Connect to wxa-skill-eval Evaluation
wxa-skill-eval is an official WeChat end-to-end evaluation tool for Mini Program AI Skills. It automatically simulates real user conversations to comprehensively assess a Skill's intent understanding, trajectory generation, and final answer quality, and outputs a multi-dimensional evaluation report.
The tool does not come with a built-in large model service — developers must supply their own model configuration. CloudBase large models are compatible with the OpenAI Chat Completions protocol and can be used directly with wxa-skill-eval, with no need to register additional model providers.
Prerequisites
- A CloudBase environment (older plans can be upgraded), with its Environment ID (
ENV_ID) - Enable the required models in Console → AI → Text Models (recommended:
hy3-previewor another high-capability model for more accurate evaluation) - A CloudBase API Key (Console → Environment Settings → API Key)
Install wxa-skill-eval
Clone the ai-mode-skills repository, then navigate to the wxa-skills-eval directory and install dependencies:
cd wxa-skills-eval
pnpm install
Configure .env
Create a .env file in the wxa-skills-eval directory and fill in your CloudBase model configuration:
BASE_URL=https://<ENV_ID>.api.tcloudbasegateway.com/v1/ai/cloudbase
API_KEY=<YOUR_CLOUDBASE_API_KEY>
MODEL=hy3-preview
Replace <ENV_ID> with your CloudBase environment ID and <YOUR_CLOUDBASE_API_KEY> with your API Key.
Set MODEL to the name of any model you have enabled in the console. Because the evaluation tool drives simulated user conversations, choose a model with high intelligence and a large parameter count for the most accurate results.
The following models are currently available through the CloudBase Resource-Point Plan:
| Model ID | Provider |
|---|---|
hy3-preview | Tencent Hunyuan |
deepseek-v4-flash-202605 | DeepSeek (official) |
deepseek-v4-pro-202606 | DeepSeek (official) |
deepseek-v4-flash | DeepSeek |
deepseek-v4-pro | DeepSeek |
deepseek-v3.2 | DeepSeek |
glm-5.1 | Zhipu AI |
glm-5v-turbo | Zhipu AI |
glm-5-turbo | Zhipu AI |
glm-5 | Zhipu AI |
kimi-k2.6 | Moonshot |
kimi-k2.5 | Moonshot |
minimax-m3 | MiniMax |
minimax-m2.7 | MiniMax |
minimax-m2.5 | MiniMax |
qwen3.5-flash | Alibaba |
qwen3.5-plus | Alibaba |
Each model must be enabled in the console before use, and a Resource-Point Plan must be activated.
The cloudbase segment in the URL is CloudBase's unified provider, compatible with all models supported via the Resource-Point Plan (DeepSeek, Hunyuan, Kimi, GLM, etc.).
Run the Evaluation
Choose either mode to start the evaluation:
Web UI mode (recommended, visual interface):
pnpm dev:web
CLI mode:
pnpm dev
Evaluation Report
After the evaluation completes, the tool generates an eval_report.html report covering the following dimensions:
| Dimension | Description |
|---|---|
| Intent Understanding | Accuracy of the Skill's interpretation of user instructions |
| Trajectory Generation | Reasonableness and completeness of the operation path |
| Final Answer Quality | Correctness and quality of the output |
| Interface Coverage | Test coverage of atomic interfaces and components |
It is recommended to run at least 30 test cases per Skill to ensure adequate coverage.
wxa-skill-eval is intended for development-stage self-testing only. Evaluation results do not constitute a basis for WeChat Mini Program review. Official submission evaluation standards will be announced by WeChat separately.