llm-benchmark-analyst
search and analyze llm benchmark results within a fixed benchmark universe, then produce evidence-based model strength and weakness reports or domain-leader summaries. use when comparing a model across benchmarks, ranking the best models by domain, explaining what a benchmark measures, checking predecessor-vs-current progress, or writing benchmark reports that must prioritize exact model version, evaluation date, benchmark variant, score semantics, sub-scores, and benchmark defect warnings. works with browser, web, and multimodal extraction for text, table, canvas, or image-only leaderboards.
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/chekhovin/llm-benchmark-analystLLM Benchmark Analyst
Overview
Use this skill to research benchmark evidence and write structured reports about:
- a single model's strengths and weaknesses
- best models in a capability domain
- what a benchmark measures and how trustworthy it is
- predecessor vs current-model progress
Default to the user's language. Never invent scores, ranks, dates, benchmark variants, or missing table values.
Core constraints
- Restrict the benchmark universe to
references/benchmark-source.md. If a benchmark is not in that file, exclude it. - Use
references/core-dimensions.mdto collapse scattered benchmarks into a small set of report dimensions. - Follow
references/search-playbook.mdfor routing, overlap expansion, evidence gathering, and comparison anchors. - Follow
references/report-template.mdfor output structure. - Apply
references/data-defect-warnings.mdbenchmark by benchmark, inline and again in the limitations section. - Prefer official benchmark or benchmark-author pages. Use aggregators mainly to discover links and context.
- Record the evaluation mode exactly: benchmark version, split, difficulty, public/private, verified/original, with-tools/without-tools, pass@k, and any visible sub-score names.
- Keep score units exact. Do not average incompatible metrics into a fake composite.
Required workflow
-
Normalize the model identity before searching
- Resolve exact provider, family, generation, version suffix, and release label.
- Put time and version first. Reject ambiguous aliases like
claude,gemini pro,gpt latest, orqwen maxuntil you have the exact currently relevant model string for the searched leaderboard rows. - Capture the evaluation time point or access date for every key score.
-
Route the request through core dimensions before web crawling
- Start with
references/core-dimensions.mdto select the primary dimension(s). - Then list candidate benchmarks inside those dimensions.
- Only then start website-by-website retrieval.
- Keep the first pass narrow and token-efficient: start from the best 3-6 benchmarks for the asked domain, then expand only if needed.
- Start with
-
Expand beyond section labels
- Do not let the source document's headings blind you.
- After selecting the primary dimension, inspect benchmark descriptions and overlap tags to find relevant benchmarks that live in other sections.
- Example: a coding analysis may need coding benchmarks, agentic coding benchmarks, general benchmarks with coding components, and research/math benchmarks with strong code components.
- Example: a multimodal analysis may need vision benchmarks, OCR, GUI/computer-use, multimodal deep-research, and omni/video/audio benchmarks.
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-chekhovin-llm-benchmark-analyst": {
"enabled": true,
"auto_update": true
}
}
}