skill-quality-eval
Skill Quality Evaluator - Score any skill on 6 dimensions. Catch 30% of skills that look good but fail silently. Based on Tessl Research 2026 findings.
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/aptratcn/xiaobai-skill-quality-evalSkill Quality Evaluator 📊
Score any skill on 6 dimensions. Catch the 30% of skills that look good but fail silently.
Why This Matters
Tessl Research (April 2026) found:
- 20% accuracy gain when using a good skill vs no skill
- 3X cost savings when small model + right skill matches large model
- 40% activation rate — agents often fail to use available skills
- 30% of evaluation tasks have leakage — skills that seem great but aren't
This skill helps you evaluate and improve your skills systematically.
6-Dimension Evaluation
1. Activation Reliability (0-100)
Can the agent find and activate this skill when needed?
Checklist:
- Trigger words are specific and unambiguous
- Description matches actual functionality
- No conflicting skills with similar triggers
- Skill is discovered when user asks relevant questions
Common Issues:
- Vague description → agent doesn't know when to use it
- Missing trigger words → skill never activates
- Too broad → activates when it shouldn't
Score Guide:
- 90+: Agent activates correctly 95%+ of the time
- 70-89: Activates in most relevant contexts
- 50-69: Sometimes activates, sometimes misses
- <50: Agent rarely finds/uses this skill
2. Task Coverage (0-100)
Does the skill handle the tasks it claims to cover?
Checklist:
- Each claimed capability has a usage example
- Edge cases are documented
- Known limitations are stated
- Failure modes are explained
Common Issues:
- Claims broad coverage but only handles happy path
- No examples for secondary features
- Undocumented prerequisites
Score Guide:
- 90+: All claimed tasks have working examples
- 70-89: Main tasks covered, some gaps in secondary features
- 50-69: Core functionality works but incomplete
- <50: Major claims unsupported
3. Instruction Clarity (0-100)
Can the agent follow the instructions without confusion?
Checklist:
- Instructions are step-by-step, not vague guidelines
- Decision points have clear criteria
- Output format is specified
- Anti-patterns are listed
Common Issues:
- "Do X when appropriate" → when is appropriate?
- Missing priority/precedence rules
- Contradictory instructions
Score Guide:
- 90+: Agent follows instructions correctly 95%+ of the time
- 70-89: Mostly clear, occasional confusion
- 50-69: Agent frequently asks for clarification
- <50: Instructions are ambiguous or contradictory
4. Leakage Resistance (0-100)
Does the evaluation actually test the skill, or does it leak answers?
Checklist:
- Examples don't contain verbatim solutions
- Test tasks require genuine skill application
- No shortcut paths that bypass skill content
- Evaluation criteria measure real capability
Common Issues (from Tessl Research):
- Example tasks are too similar to skill content
- Skill contains answers verbatim
- Test can be solved by pattern matching without understanding
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-aptratcn-xiaobai-skill-quality-eval": {
"enabled": true,
"auto_update": true
}
}
}Tags
Related Skills
onlyclaw-social-commerce
在只来龙虾平台以龙虾身份自动发帖带货、读取帖子、检索帖子、点赞评论,支持关联商品/店铺/Skill、封面与视频(先上传再发帖),实现 AI Agent 24h 社交电商自动运营
china-tour
AI-powered tour guide with backend API and offline fallback. Personalized routes, photo spots, cultural narration for China's scenic spots. Bilingual support. 中国景区智能导览助手,支持后端API增强与离线备份,个性化路线推荐、拍照机位、文化讲解,中英双语。
verify-before-done
Prevent premature completion claims, repeated same-pattern retries, and weak handoffs. Use this skill to improve verification, strategy switching, and blocked-task reporting without changing personality or tone.
evidence-gap-mapper
在报告、方案或演示稿中定位结论先行但证据不足的位置,并给出补证优先级。;use for evidence, gap-analysis, research workflows;do not use for 伪造数据支撑结论, 忽略高风险假设.
human_test
Call real humans to test your product (URL or app). Get structured usability feedback with screen recordings, NPS scores, and AI-aggregated findings.