eval-skills
AI Agent Skill unit testing framework. A framework-agnostic toolkit for discovering, scaffolding, selecting, evaluating, and reporting on AI skills. Use this skill to assess skill quality before production, compare candidate skills on the same benchmark, enforce quality gates in CI/CD, and generate human-readable evaluation reports.
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/islinxu/eval-skillseval-skills
AI Agent Skill unit testing framework — a framework-agnostic toolkit for discovering, scaffolding, selecting, evaluating, and reporting on AI skills.
This skill fills the L1 (Skill Unit Test) gap that LangSmith / DeepEval leave open: while those platforms focus on agent-level and trajectory-level evaluation (L2-L3), eval-skills targets the individual skill level, ensuring each building block meets quality standards before it ever enters an agent pipeline.
When to Use This Skill
- Before deploying a new skill to production — run
evalto verify it meets your quality gate. - When choosing between multiple candidate skills — run
selectto rank them on the same benchmark. - When a skill is upgraded — run
report diffto detect regressions. - In CI/CD — use
--exit-on-failto block merges that degrade skill quality. - When bootstrapping a new skill — run
createto generate a ready-to-fill skeleton.
Capabilities
1. Find Skills
Search for existing skills by keyword, tag, or adapter type.
eval-skills find \
--query "web search" \
--tag retrieval api \
--adapter http \
--min-completion 0.8 \
--skills-dir ./skills \
--limit 10
| Option | Description | Default |
|---|---|---|
-q, --query <string> | Keyword search (matches name, description, tags) | — |
-t, --tag <tags...> | Filter by tags (intersection: skill must have ALL specified tags) | — |
-a, --adapter <type> | Filter by adapter type (http, subprocess, mcp) | — |
--min-completion <rate> | Minimum historical completion rate (0.0 ~ 1.0) | — |
--skills-dir <dir> | Directory to scan for skill.json files | ./skills |
--limit <n> | Maximum number of results | 20 |
Results are ranked by search relevance (when --query is provided) or by historical completion rate (descending).
2. Create Skills
Generate a skill skeleton from a template to bootstrap development.
eval-skills create \
--name my_api_skill \
--from-template http_request \
--output-dir ./skills \
--description "Fetches weather data from OpenWeather API"
| Option | Description | Default |
|---|---|---|
--name <name> | Required. Skill name | — |
--from-template <tpl> | Template type: http_request, python_script, mcp_tool | http_request |
--output-dir <dir> | Output directory | ./skills |
--description <text> | Human-readable description embedded in skill.json | — |
Generated file structure:
skills/my_api_skill/
skill.json # Skill metadata (id, schemas, adapter config)
adapter.config.json # Adapter-specific configuration
tests/
basic.eval.json # A starter benchmark with one sample task
skill.py # (python_script template only) JSON-RPC entrypoint
3. Evaluate Skills
Run benchmark evaluations against one or more skills. This is the core command.
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-islinxu-eval-skills": {
"enabled": true,
"auto_update": true
}
}
}