Skip to main content

Connect to wxa-skill-eval Evaluation

wxa-skill-eval is an official WeChat end-to-end evaluation tool for Mini Program AI Skills. It automatically simulates real user conversations to comprehensively assess a Skill's intent understanding, trajectory generation, and final answer quality, and outputs a multi-dimensional evaluation report.

The tool does not come with a built-in large model service — developers must supply their own model configuration. CloudBase large models are compatible with the OpenAI Chat Completions protocol and can be used directly with wxa-skill-eval, with no need to register additional model providers.

Prerequisites

  1. A CloudBase environment with its Environment ID (ENV_ID)
  2. A Token Resource Pack purchased
  3. Enable the required models in Console → AI → Text Models (recommended: hy3-preview or another high-capability model for more accurate evaluation)
  4. A CloudBase API Key (Console → Environment Settings → API Key)

Install wxa-skill-eval

Clone the ai-mode-skills repository, then navigate to the wxa-skills-eval directory and install dependencies:

cd wxa-skills-eval
pnpm install

Configure .env

Create a .env file in the wxa-skills-eval directory and fill in your CloudBase model configuration:

BASE_URL=https://<ENV_ID>.api.tcloudbasegateway.com/v1/ai/cloudbase
API_KEY=<YOUR_CLOUDBASE_API_KEY>
MODEL=hy3-preview

Replace <ENV_ID> with your CloudBase environment ID and <YOUR_CLOUDBASE_API_KEY> with your API Key.

Model Selection

Set MODEL to the name of any model you have enabled in the console. Because the evaluation tool drives simulated user conversations, choose a model with high intelligence and a large parameter count for the most accurate results.

The following models are currently available through the CloudBase Token Resource Pack:

Model IDProvider
hy3-previewTencent Hunyuan
deepseek-v4-flash-202605DeepSeek (official)
deepseek-v4-pro-202606DeepSeek (official)
deepseek-v4-flashDeepSeek
deepseek-v4-proDeepSeek
deepseek-v3.2DeepSeek
glm-5.1Zhipu AI
glm-5v-turboZhipu AI
glm-5-turboZhipu AI
glm-5Zhipu AI
kimi-k2.6Moonshot
kimi-k2.5Moonshot
minimax-m3MiniMax
minimax-m2.7MiniMax
minimax-m2.5MiniMax
qwen3.5-flashAlibaba
qwen3.5-plusAlibaba

Each model must be enabled in the console before use, and a Token Resource Pack must be purchased.

About BASE_URL

The cloudbase segment in the URL is CloudBase's unified provider, compatible with all models purchased via the Token Resource Pack (DeepSeek, Hunyuan, Kimi, GLM, etc.).

Run the Evaluation

Choose either mode to start the evaluation:

Web UI mode (recommended, visual interface):

pnpm dev:web

CLI mode:

pnpm dev

Evaluation Report

After the evaluation completes, the tool generates an eval_report.html report covering the following dimensions:

DimensionDescription
Intent UnderstandingAccuracy of the Skill's interpretation of user instructions
Trajectory GenerationReasonableness and completeness of the operation path
Final Answer QualityCorrectness and quality of the output
Interface CoverageTest coverage of atomic interfaces and components

It is recommended to run at least 30 test cases per Skill to ensure adequate coverage.

note

wxa-skill-eval is intended for development-stage self-testing only. Evaluation results do not constitute a basis for WeChat Mini Program review. Official submission evaluation standards will be announced by WeChat separately.