ClawKit Logo
ClawKitReliability Toolkit
Back to Registry
Official Verified

agentbench

Benchmark your OpenClaw agent across 40 real-world tasks. Tests file creation, research, data analysis, multi-step workflows, memory, error handling, and tool efficiency. Not a coding benchmark — measures your agent setup and config.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/exe215/agentbench
Or

AgentBench for OpenClaw

Benchmark your OpenClaw agent's general capabilities across 40 real-world tasks spanning 7 domains.

Commands

When the user says any of these, follow the corresponding instructions:

  • /benchmark — Run the full benchmark suite (all 40 tasks)
  • /benchmark --fast — Run only easy+medium tasks (19 tasks)
  • /benchmark --suite <name> — Run one domain only
  • /benchmark --task <id> — Run a single task
  • /benchmark --strict — Tag results as externally verified scoring
  • /benchmark-list — List all tasks grouped by domain
  • /benchmark-results — Show results from previous runs
  • /benchmark-compare — Compare two runs side-by-side

Flags are combinable: /benchmark --fast --suite research

Running a Benchmark

Step 1: Discover Tasks

Read task.yaml files from the tasks/ directory in this skill:

tasks/{suite-name}/{task-name}/task.yaml

Each task.yaml contains: name, id, suite, difficulty, mode, user_message, input_files, expected_outputs, expected_metrics, scoring weights.

Filter by --suite or --task if specified. If --fast is set and --task is not, filter to only tasks where difficulty is "easy" or "medium".

Profile is "fast" if --fast was specified, otherwise "full".

List discovered tasks with count and suites.

Step 2: Set Up Run Directory

Generate a run ID from the current timestamp: YYYYMMDD-HHmmss

Read suite_version from skill.json in this skill directory.

Create the results directory:

agentbench-results/{run-id}/

Announce: Starting AgentBench run {run-id} | Profile: {profile} | Suite version: {suite_version} | Tasks: {count}

Step 3: Execute Each Task

For each task:

  1. Set up workspace:

    • Create /tmp/agentbench-task-{task-id}/ as workspace
    • Copy input files from tasks/{suite}/{task}/inputs/ to the workspace (if inputs/ exists)
    • If the task directory contains a setup.sh: run bash tasks/{suite}/{task}/setup.sh {workspace-path}
    • For file-unchanged validators: compute checksums of specified files after setup, before task execution
  2. Announce: Running: {task.name} [{task.suite}] (difficulty: {task.difficulty})

  3. Record start time (milliseconds): date +%s%3N

  4. Execute the task yourself directly:

    • Read the task's user_message and execute it as if a real user sent you the request
    • Work ONLY within the workspace directory
    • If input files are listed, read them from the workspace
    • Execute naturally — use the appropriate tools (read, write, edit, exec, web_search, web_fetch, etc.)
    • Create any output files in the workspace directory
    • When done, write a brief execution-trace.md to the workspace:
      • What you understood the task to be
      • What approach you took
      • What files you created or modified
      • Any difficulties or decisions you made
  5. Record end time and compute duration

Metadata

Author@exe215
Stars2387
Views0
Updated2026-03-09
View Author Profile
AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill
Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-exe215-agentbench": {
      "enabled": true,
      "auto_update": true
    }
  }
}
Safety NoteClawKit audits metadata but not runtime behavior. Use with caution.