ClawKit Logo
ClawKitReliability Toolkit
Back to Registry
Official Verified

eval-skills

AI Agent Skill unit testing framework. A framework-agnostic toolkit for discovering, scaffolding, selecting, evaluating, and reporting on AI skills. Use this skill to assess skill quality before production, compare candidate skills on the same benchmark, enforce quality gates in CI/CD, and generate human-readable evaluation reports.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/islinxu/eval-skills
Or

eval-skills

AI Agent Skill unit testing framework — a framework-agnostic toolkit for discovering, scaffolding, selecting, evaluating, and reporting on AI skills.

This skill fills the L1 (Skill Unit Test) gap that LangSmith / DeepEval leave open: while those platforms focus on agent-level and trajectory-level evaluation (L2-L3), eval-skills targets the individual skill level, ensuring each building block meets quality standards before it ever enters an agent pipeline.

When to Use This Skill

  • Before deploying a new skill to production — run eval to verify it meets your quality gate.
  • When choosing between multiple candidate skills — run select to rank them on the same benchmark.
  • When a skill is upgraded — run report diff to detect regressions.
  • In CI/CD — use --exit-on-fail to block merges that degrade skill quality.
  • When bootstrapping a new skill — run create to generate a ready-to-fill skeleton.

Capabilities

1. Find Skills

Search for existing skills by keyword, tag, or adapter type.

eval-skills find \
  --query "web search" \
  --tag retrieval api \
  --adapter http \
  --min-completion 0.8 \
  --skills-dir ./skills \
  --limit 10
OptionDescriptionDefault
-q, --query <string>Keyword search (matches name, description, tags)
-t, --tag <tags...>Filter by tags (intersection: skill must have ALL specified tags)
-a, --adapter <type>Filter by adapter type (http, subprocess, mcp)
--min-completion <rate>Minimum historical completion rate (0.0 ~ 1.0)
--skills-dir <dir>Directory to scan for skill.json files./skills
--limit <n>Maximum number of results20

Results are ranked by search relevance (when --query is provided) or by historical completion rate (descending).

2. Create Skills

Generate a skill skeleton from a template to bootstrap development.

eval-skills create \
  --name my_api_skill \
  --from-template http_request \
  --output-dir ./skills \
  --description "Fetches weather data from OpenWeather API"
OptionDescriptionDefault
--name <name>Required. Skill name
--from-template <tpl>Template type: http_request, python_script, mcp_toolhttp_request
--output-dir <dir>Output directory./skills
--description <text>Human-readable description embedded in skill.json

Generated file structure:

skills/my_api_skill/
  skill.json            # Skill metadata (id, schemas, adapter config)
  adapter.config.json   # Adapter-specific configuration
  tests/
    basic.eval.json     # A starter benchmark with one sample task
  skill.py              # (python_script template only) JSON-RPC entrypoint

3. Evaluate Skills

Run benchmark evaluations against one or more skills. This is the core command.

Metadata

Author@islinxu
Stars2287
Views1
Updated2026-03-09
View Author Profile
AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill
Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-islinxu-eval-skills": {
      "enabled": true,
      "auto_update": true
    }
  }
}
Safety NoteClawKit audits metadata but not runtime behavior. Use with caution.