ClawKit Logo
ClawKitReliability Toolkit
Back to Registry
Official Verified

skill-quality-eval

Skill Quality Evaluator - Score any skill on 6 dimensions. Catch 30% of skills that look good but fail silently. Based on Tessl Research 2026 findings.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/aptratcn/xiaobai-skill-quality-eval
Or

Skill Quality Evaluator 📊

Score any skill on 6 dimensions. Catch the 30% of skills that look good but fail silently.

Why This Matters

Tessl Research (April 2026) found:

  • 20% accuracy gain when using a good skill vs no skill
  • 3X cost savings when small model + right skill matches large model
  • 40% activation rate — agents often fail to use available skills
  • 30% of evaluation tasks have leakage — skills that seem great but aren't

This skill helps you evaluate and improve your skills systematically.

6-Dimension Evaluation

1. Activation Reliability (0-100)

Can the agent find and activate this skill when needed?

Checklist:

  • Trigger words are specific and unambiguous
  • Description matches actual functionality
  • No conflicting skills with similar triggers
  • Skill is discovered when user asks relevant questions

Common Issues:

  • Vague description → agent doesn't know when to use it
  • Missing trigger words → skill never activates
  • Too broad → activates when it shouldn't

Score Guide:

  • 90+: Agent activates correctly 95%+ of the time
  • 70-89: Activates in most relevant contexts
  • 50-69: Sometimes activates, sometimes misses
  • <50: Agent rarely finds/uses this skill

2. Task Coverage (0-100)

Does the skill handle the tasks it claims to cover?

Checklist:

  • Each claimed capability has a usage example
  • Edge cases are documented
  • Known limitations are stated
  • Failure modes are explained

Common Issues:

  • Claims broad coverage but only handles happy path
  • No examples for secondary features
  • Undocumented prerequisites

Score Guide:

  • 90+: All claimed tasks have working examples
  • 70-89: Main tasks covered, some gaps in secondary features
  • 50-69: Core functionality works but incomplete
  • <50: Major claims unsupported

3. Instruction Clarity (0-100)

Can the agent follow the instructions without confusion?

Checklist:

  • Instructions are step-by-step, not vague guidelines
  • Decision points have clear criteria
  • Output format is specified
  • Anti-patterns are listed

Common Issues:

  • "Do X when appropriate" → when is appropriate?
  • Missing priority/precedence rules
  • Contradictory instructions

Score Guide:

  • 90+: Agent follows instructions correctly 95%+ of the time
  • 70-89: Mostly clear, occasional confusion
  • 50-69: Agent frequently asks for clarification
  • <50: Instructions are ambiguous or contradictory

4. Leakage Resistance (0-100)

Does the evaluation actually test the skill, or does it leak answers?

Checklist:

  • Examples don't contain verbatim solutions
  • Test tasks require genuine skill application
  • No shortcut paths that bypass skill content
  • Evaluation criteria measure real capability

Common Issues (from Tessl Research):

  • Example tasks are too similar to skill content
  • Skill contains answers verbatim
  • Test can be solved by pattern matching without understanding

Metadata

Author@aptratcn
Stars4473
Views1
Updated2026-05-01
View Author Profile
AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill
Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-aptratcn-xiaobai-skill-quality-eval": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Tags

#evaluation#quality#skill#testing#reliability#ai-agent
Safety NoteClawKit audits metadata but not runtime behavior. Use with caution.