ClawKit Logo
ClawKitReliability Toolkit
Back to Registry
Official Verified

llm-benchmark-analyst

search and analyze llm benchmark results within a fixed benchmark universe, then produce evidence-based model strength and weakness reports or domain-leader summaries. use when comparing a model across benchmarks, ranking the best models by domain, explaining what a benchmark measures, checking predecessor-vs-current progress, or writing benchmark reports that must prioritize exact model version, evaluation date, benchmark variant, score semantics, sub-scores, and benchmark defect warnings. works with browser, web, and multimodal extraction for text, table, canvas, or image-only leaderboards.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/chekhovin/llm-benchmark-analyst
Or

LLM Benchmark Analyst

Overview

Use this skill to research benchmark evidence and write structured reports about:

  1. a single model's strengths and weaknesses
  2. best models in a capability domain
  3. what a benchmark measures and how trustworthy it is
  4. predecessor vs current-model progress

Default to the user's language. Never invent scores, ranks, dates, benchmark variants, or missing table values.

Core constraints

  • Restrict the benchmark universe to references/benchmark-source.md. If a benchmark is not in that file, exclude it.
  • Use references/core-dimensions.md to collapse scattered benchmarks into a small set of report dimensions.
  • Follow references/search-playbook.md for routing, overlap expansion, evidence gathering, and comparison anchors.
  • Follow references/report-template.md for output structure.
  • Apply references/data-defect-warnings.md benchmark by benchmark, inline and again in the limitations section.
  • Prefer official benchmark or benchmark-author pages. Use aggregators mainly to discover links and context.
  • Record the evaluation mode exactly: benchmark version, split, difficulty, public/private, verified/original, with-tools/without-tools, pass@k, and any visible sub-score names.
  • Keep score units exact. Do not average incompatible metrics into a fake composite.

Required workflow

  1. Normalize the model identity before searching

    • Resolve exact provider, family, generation, version suffix, and release label.
    • Put time and version first. Reject ambiguous aliases like claude, gemini pro, gpt latest, or qwen max until you have the exact currently relevant model string for the searched leaderboard rows.
    • Capture the evaluation time point or access date for every key score.
  2. Route the request through core dimensions before web crawling

    • Start with references/core-dimensions.md to select the primary dimension(s).
    • Then list candidate benchmarks inside those dimensions.
    • Only then start website-by-website retrieval.
    • Keep the first pass narrow and token-efficient: start from the best 3-6 benchmarks for the asked domain, then expand only if needed.
  3. Expand beyond section labels

    • Do not let the source document's headings blind you.
    • After selecting the primary dimension, inspect benchmark descriptions and overlap tags to find relevant benchmarks that live in other sections.
    • Example: a coding analysis may need coding benchmarks, agentic coding benchmarks, general benchmarks with coding components, and research/math benchmarks with strong code components.
    • Example: a multimodal analysis may need vision benchmarks, OCR, GUI/computer-use, multimodal deep-research, and omni/video/audio benchmarks.

Metadata

Author@chekhovin
Stars3875
Views0
Updated2026-04-07
View Author Profile
AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill
Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-chekhovin-llm-benchmark-analyst": {
      "enabled": true,
      "auto_update": true
    }
  }
}
Safety NoteClawKit audits metadata but not runtime behavior. Use with caution.