Official Verified

llm-benchmark-analyst

search and analyze llm benchmark results within a fixed benchmark universe, then produce evidence-based model strength and weakness reports or domain-leader summaries. use when comparing a model across benchmarks, ranking the best models by domain, explaining what a benchmark measures, checking predecessor-vs-current progress, or writing benchmark reports that must prioritize exact model version, evaluation date, benchmark variant, score semantics, sub-scores, and benchmark defect warnings. works with browser, web, and multimodal extraction for text, table, canvas, or image-only leaderboards.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/chekhovin/llm-benchmark-analyst

Download Source Code (.zip)

LLM Benchmark Analyst

Overview

Use this skill to research benchmark evidence and write structured reports about:

a single model's strengths and weaknesses
best models in a capability domain
what a benchmark measures and how trustworthy it is
predecessor vs current-model progress

Default to the user's language. Never invent scores, ranks, dates, benchmark variants, or missing table values.

Core constraints

Restrict the benchmark universe to references/benchmark-source.md. If a benchmark is not in that file, exclude it.
Use references/core-dimensions.md to collapse scattered benchmarks into a small set of report dimensions.
Follow references/search-playbook.md for routing, overlap expansion, evidence gathering, and comparison anchors.
Follow references/report-template.md for output structure.
Apply references/data-defect-warnings.md benchmark by benchmark, inline and again in the limitations section.
Prefer official benchmark or benchmark-author pages. Use aggregators mainly to discover links and context.
Record the evaluation mode exactly: benchmark version, split, difficulty, public/private, verified/original, with-tools/without-tools, pass@k, and any visible sub-score names.
Keep score units exact. Do not average incompatible metrics into a fake composite.

Required workflow

Normalize the model identity before searching
- Resolve exact provider, family, generation, version suffix, and release label.
- Put time and version first. Reject ambiguous aliases like claude, gemini pro, gpt latest, or qwen max until you have the exact currently relevant model string for the searched leaderboard rows.
- Capture the evaluation time point or access date for every key score.
Route the request through core dimensions before web crawling
- Start with references/core-dimensions.md to select the primary dimension(s).
- Then list candidate benchmarks inside those dimensions.
- Only then start website-by-website retrieval.
- Keep the first pass narrow and token-efficient: start from the best 3-6 benchmarks for the asked domain, then expand only if needed.
Expand beyond section labels
- Do not let the source document's headings blind you.
- After selecting the primary dimension, inspect benchmark descriptions and overlap tags to find relevant benchmarks that live in other sections.
- Example: a coding analysis may need coding benchmarks, agentic coding benchmarks, general benchmarks with coding components, and research/math benchmarks with strong code components.
- Example: a multimodal analysis may need vision benchmarks, OCR, GUI/computer-use, multimodal deep-research, and omni/video/audio benchmarks.

Read Full Documentation on GitHub

Metadata

Author@chekhovin

Stars3875

Updated2026-04-07

View Author Profile

AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill

Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-chekhovin-llm-benchmark-analyst": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Safety NoteClawKit audits metadata but not runtime behavior. Use with caution.

Related Skills

api-quality-check

Check coding-model API quality, capability fit, and drift with LT-lite and B3IT-lite. Use when Codex needs to verify whether an OpenAI/OpenAI-compatible/Anthropic endpoint can support first-token detection, logprob tracking, baseline-vs-current drift checks, or headless API quality smoke tests for coding CLIs, terminal agents, and OpenClaw-style workflows.

chekhovin 3875