eval-skills

AI Agent Skill unit testing framework — a framework-agnostic toolkit for discovering, scaffolding, selecting, evaluating, and reporting on AI skills.

This skill fills the L1 (Skill Unit Test) gap that LangSmith / DeepEval leave open: while those platforms focus on agent-level and trajectory-level evaluation (L2-L3), eval-skills targets the individual skill level, ensuring each building block meets quality standards before it ever enters an agent pipeline.

When to Use This Skill

Before deploying a new skill to production — run eval to verify it meets your quality gate.
When choosing between multiple candidate skills — run select to rank them on the same benchmark.
When a skill is upgraded — run report diff to detect regressions.
In CI/CD — use --exit-on-fail to block merges that degrade skill quality.
When bootstrapping a new skill — run create to generate a ready-to-fill skeleton.

Capabilities

1. Find Skills

Search for existing skills by keyword, tag, or adapter type.

eval-skills find \
  --query "web search" \
  --tag retrieval api \
  --adapter http \
  --min-completion 0.8 \
  --skills-dir ./skills \
  --limit 10

Option	Description	Default
`-q, --query <string>`	Keyword search (matches name, description, tags)	—
`-t, --tag <tags...>`	Filter by tags (intersection: skill must have ALL specified tags)	—
`-a, --adapter <type>`	Filter by adapter type (`http`, `subprocess`, `mcp`)	—
`--min-completion <rate>`	Minimum historical completion rate (0.0 ~ 1.0)	—
`--skills-dir <dir>`	Directory to scan for `skill.json` files	`./skills`
`--limit <n>`	Maximum number of results	`20`

Results are ranked by search relevance (when --query is provided) or by historical completion rate (descending).

2. Create Skills

Generate a skill skeleton from a template to bootstrap development.

eval-skills create \
  --name my_api_skill \
  --from-template http_request \
  --output-dir ./skills \
  --description "Fetches weather data from OpenWeather API"

Option	Description	Default
`--name <name>`	Required. Skill name	—
`--from-template <tpl>`	Template type: `http_request`, `python_script`, `mcp_tool`	`http_request`
`--output-dir <dir>`	Output directory	`./skills`
`--description <text>`	Human-readable description embedded in `skill.json`	—

Generated file structure:

skills/my_api_skill/
  skill.json            # Skill metadata (id, schemas, adapter config)
  adapter.config.json   # Adapter-specific configuration
  tests/
    basic.eval.json     # A starter benchmark with one sample task
  skill.py              # (python_script template only) JSON-RPC entrypoint

3. Evaluate Skills

Run benchmark evaluations against one or more skills. This is the core command.

eval-skills

Install via CLI (Recommended)

eval-skills

When to Use This Skill

Capabilities

1. Find Skills

2. Create Skills

3. Evaluate Skills

Metadata