Connect to wxa-skill-eval Evaluation
wxa-skill-eval is an official WeChat end-to-end evaluation tool for Mini Program AI Skills. It automatically simulates real user conversations to comprehensively assess a Skill's intent understanding, trajectory generation, and final answer quality, and outputs a multi-dimensional evaluation report.
The tool does not come with a built-in large model service — developers must supply their own model configuration. CloudBase large models are compatible with the OpenAI Chat Completions protocol and can be used directly with wxa-skill-eval, with no need to register additional model providers.
Prerequisites
- A CloudBase environment with its Environment ID (
ENV_ID) - A Token Resource Pack purchased
- Enable the required models in Console → AI → Text Models (recommended:
hy3-previewor another high-capability model for more accurate evaluation) - A CloudBase API Key (Console → Environment Settings → API Key)
Install wxa-skill-eval
Clone the ai-mode-skills repository, then navigate to the wxa-skills-eval directory and install dependencies:
cd wxa-skills-eval
pnpm install
Configure .env
Create a .env file in the wxa-skills-eval directory and fill in your CloudBase model configuration:
BASE_URL=https://<ENV_ID>.api.tcloudbasegateway.com/v1/ai/cloudbase
API_KEY=<YOUR_CLOUDBASE_API_KEY>
MODEL=hy3-preview
Replace <ENV_ID> with your CloudBase environment ID and <YOUR_CLOUDBASE_API_KEY> with your API Key.
Set MODEL to the name of any model you have enabled in the console. Because the evaluation tool drives simulated user conversations, choose a model with high intelligence and a large parameter count for the most accurate results.
The following models are currently available through the CloudBase Token Resource Pack:
| Model ID | Provider |
|---|---|
hy3-preview | Tencent Hunyuan |
deepseek-v4-flash-202605 | DeepSeek (official) |
deepseek-v4-pro-202606 | DeepSeek (official) |
deepseek-v4-flash | DeepSeek |
deepseek-v4-pro | DeepSeek |
deepseek-v3.2 | DeepSeek |
glm-5.1 | Zhipu AI |
glm-5v-turbo | Zhipu AI |
glm-5-turbo | Zhipu AI |
glm-5 | Zhipu AI |
kimi-k2.6 | Moonshot |
kimi-k2.5 | Moonshot |
minimax-m3 | MiniMax |
minimax-m2.7 | MiniMax |
minimax-m2.5 | MiniMax |
qwen3.5-flash | Alibaba |
qwen3.5-plus | Alibaba |
Each model must be enabled in the console before use, and a Token Resource Pack must be purchased.
The cloudbase segment in the URL is CloudBase's unified provider, compatible with all models purchased via the Token Resource Pack (DeepSeek, Hunyuan, Kimi, GLM, etc.).
Run the Evaluation
Choose either mode to start the evaluation:
Web UI mode (recommended, visual interface):
pnpm dev:web
CLI mode:
pnpm dev
Evaluation Report
After the evaluation completes, the tool generates an eval_report.html report covering the following dimensions:
| Dimension | Description |
|---|---|
| Intent Understanding | Accuracy of the Skill's interpretation of user instructions |
| Trajectory Generation | Reasonableness and completeness of the operation path |
| Final Answer Quality | Correctness and quality of the output |
| Interface Coverage | Test coverage of atomic interfaces and components |
It is recommended to run at least 30 test cases per Skill to ensure adequate coverage.
wxa-skill-eval is intended for development-stage self-testing only. Evaluation results do not constitute a basis for WeChat Mini Program review. Official submission evaluation standards will be announced by WeChat separately.