AgentBench for OpenClaw

Benchmark your OpenClaw agent's general capabilities across 40 real-world tasks spanning 7 domains.

Commands

When the user says any of these, follow the corresponding instructions:

Flags are combinable: /benchmark --fast --suite research

Read task.yaml files from the tasks/ directory in this skill:

tasks/{suite-name}/{task-name}/task.yaml

Each task.yaml contains: name, id, suite, difficulty, mode, user_message, input_files, expected_outputs, expected_metrics, scoring weights.

Filter by --suite or --task if specified. If --fast is set and --task is not, filter to only tasks where difficulty is "easy" or "medium".

Profile is "fast" if --fast was specified, otherwise "full".

List discovered tasks with count and suites.

Generate a run ID from the current timestamp: YYYYMMDD-HHmmss

Read suite_version from skill.json in this skill directory.

Create the results directory:

agentbench-results/{run-id}/

Announce: Starting AgentBench run {run-id} | Profile: {profile} | Suite version: {suite_version} | Tasks: {count}

For each task:

Set up workspace:
- Create /tmp/agentbench-task-{task-id}/ as workspace
- Copy input files from tasks/{suite}/{task}/inputs/ to the workspace (if inputs/ exists)
- If the task directory contains a setup.sh: run bash tasks/{suite}/{task}/setup.sh {workspace-path}
- For file-unchanged validators: compute checksums of specified files after setup, before task execution
Announce: Running: {task.name} [{task.suite}] (difficulty: {task.difficulty})
Record start time (milliseconds): date +%s%3N
Execute the task yourself directly:
- Read the task's user_message and execute it as if a real user sent you the request
- Work ONLY within the workspace directory
- If input files are listed, read them from the workspace
- Execute naturally — use the appropriate tools (read, write, edit, exec, web_search, web_fetch, etc.)
- Create any output files in the workspace directory
- When done, write a brief execution-trace.md to the workspace:
  - What you understood the task to be
  - What approach you took
  - What files you created or modified
  - Any difficulties or decisions you made
Record end time and compute duration