agentbench
Benchmark your OpenClaw agent across 40 real-world tasks. Tests file creation, research, data analysis, multi-step workflows, memory, error handling, and tool efficiency. Not a coding benchmark — measures your agent setup and config.
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/exe215/agentbenchAgentBench for OpenClaw
Benchmark your OpenClaw agent's general capabilities across 40 real-world tasks spanning 7 domains.
Commands
When the user says any of these, follow the corresponding instructions:
/benchmark— Run the full benchmark suite (all 40 tasks)/benchmark --fast— Run only easy+medium tasks (19 tasks)/benchmark --suite <name>— Run one domain only/benchmark --task <id>— Run a single task/benchmark --strict— Tag results as externally verified scoring/benchmark-list— List all tasks grouped by domain/benchmark-results— Show results from previous runs/benchmark-compare— Compare two runs side-by-side
Flags are combinable: /benchmark --fast --suite research
Running a Benchmark
Step 1: Discover Tasks
Read task.yaml files from the tasks/ directory in this skill:
tasks/{suite-name}/{task-name}/task.yaml
Each task.yaml contains: name, id, suite, difficulty, mode, user_message, input_files, expected_outputs, expected_metrics, scoring weights.
Filter by --suite or --task if specified. If --fast is set and --task is not, filter to only tasks where difficulty is "easy" or "medium".
Profile is "fast" if --fast was specified, otherwise "full".
List discovered tasks with count and suites.
Step 2: Set Up Run Directory
Generate a run ID from the current timestamp: YYYYMMDD-HHmmss
Read suite_version from skill.json in this skill directory.
Create the results directory:
agentbench-results/{run-id}/
Announce: Starting AgentBench run {run-id} | Profile: {profile} | Suite version: {suite_version} | Tasks: {count}
Step 3: Execute Each Task
For each task:
-
Set up workspace:
- Create
/tmp/agentbench-task-{task-id}/as workspace - Copy input files from
tasks/{suite}/{task}/inputs/to the workspace (if inputs/ exists) - If the task directory contains a
setup.sh: runbash tasks/{suite}/{task}/setup.sh {workspace-path} - For
file-unchangedvalidators: compute checksums of specified files after setup, before task execution
- Create
-
Announce:
Running: {task.name} [{task.suite}] (difficulty: {task.difficulty}) -
Record start time (milliseconds):
date +%s%3N -
Execute the task yourself directly:
- Read the task's
user_messageand execute it as if a real user sent you the request - Work ONLY within the workspace directory
- If input files are listed, read them from the workspace
- Execute naturally — use the appropriate tools (read, write, edit, exec, web_search, web_fetch, etc.)
- Create any output files in the workspace directory
- When done, write a brief
execution-trace.mdto the workspace:- What you understood the task to be
- What approach you took
- What files you created or modified
- Any difficulties or decisions you made
- Read the task's
-
Record end time and compute duration
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-exe215-agentbench": {
"enabled": true,
"auto_update": true
}
}
}