Official Verified developer tools Safety 4/5

memory-bench-pioneer

Be one of the first to benchmark your agent's memory — and help shape how AI remembers. Runs a peer-review-grade evaluation suite (LLM-as-judge, nDCG/MAP/MRR with 95% CIs, ablation studies) against your live memory system and submits anonymized results to the ENGRAM/CORTEX research papers. Your data stays private; only aggregate stats leave. Works with agent-memory-ultimate. For the bold few who believe AI memory should be measured, not guessed at.

Why use this skill?

Benchmark your OpenClaw agent's memory with peer-review-grade evaluation tools. Measure, analyze, and submit anonymized data to the ENGRAM and CORTEX research papers.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/globalcaos/memory-bench-pioneer

Download Source Code (.zip)

What This Skill Does

The memory-bench-pioneer skill provides a peer-review-grade evaluation suite for agent memory systems, specifically designed to function alongside the agent-memory-ultimate architecture. It serves as a rigorous testing framework to measure, analyze, and report on the efficacy of AI memory. By running standardized benchmarks, including LLM-as-a-judge assessments, nDCG, MAP, and MRR metrics with 95% confidence intervals, developers can gain deep insights into their agent's retrieval performance. The skill also performs ablation studies to isolate the impact of mechanisms like spreading activation, ensuring that memory improvements are data-driven rather than anecdotal. All results are anonymized and can be submitted to the ENGRAM and CORTEX research initiatives, fostering a collaborative, evidence-based approach to AI memory design.

Installation

To add this capability to your agent, execute the following command in your terminal:

clawhub install openclaw/skills/skills/globalcaos/memory-bench-pioneer

Ensure you have the gh CLI installed if you intend to use the automated PR submission workflow, and verify that your openai API key is configured if you prefer the higher-accuracy GPT-4o-mini judge method.

Use Cases

Research Benchmarking: Contribute to the ENGRAM and CORTEX papers by providing anonymized, statistically significant data about memory retrieval performance.
System Optimization: Identify bottlenecks in your agent's retrieval logic by using ablation studies to see how specific features affect metrics like Hit Rate and Precision.
Cross-Site Comparability: Use the standardized test set to ensure your agent's memory performance is measured using the same criteria as other OpenClaw users, allowing for objective comparison.
Longitudinal Tracking: Run benchmarks over time to ensure that memory system updates, consolidation, and pruning lead to actual performance gains rather than degradation.

Example Prompts

"Run a full memory performance assessment using the GPT-4o-mini judge and include an ablation analysis."
"Collect my current memory statistics, review the summary, and generate a pull request for the ENGRAM research repository."
"Evaluate my agent's retrieval quality against the standard test set and output the results for my internal review."

Tips & Limitations

For the most accurate results, always aim for at least 30 queries across the full spectrum of difficulty and type (semantic, episodic, procedural, strategic). Note that the openai judge incurs a small cost (approx. $0.01 per run), while the local judge option is free but less precise. Your privacy is paramount; the collect.py script ensures that your actual memory content, hostnames, and user queries are never included in the exported report. Only aggregate statistics, histograms, and anonymized performance metrics are shared in the PR submission. Always verify the agent-memory-ultimate configuration before beginning a benchmark run.

Read Full Documentation on GitHub

Metadata

Author@globalcaos

Stars2387

Updated2026-03-09

View Author Profile

AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill

Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-globalcaos-memory-bench-pioneer": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Tags(AI)

#memory#evaluation#benchmarking#llm#research

Safety Score: 4/5

Flags: file-write, file-read, data-collection, external-api, code-execution

Related Skills

jarvis-voice

Turn your AI into JARVIS. Voice, wit, and personality — the complete package. Humor cranked to maximum.

globalcaos 2387

shell-security-ultimate

Classify every shell command as SAFE, WARN, or CRIT before your agent runs it.

globalcaos 2387

memory-pioneer

Benchmark your agent's memory. Contribute anonymized scores to open research. Citizen science for AI memory.

globalcaos 2387

subagent-overseer

Monitor sub-agent health and progress via a pull-based bash daemon. Use when spawning sub-agents that need progress tracking, staleness detection, and automatic status reporting. Replaces manual heartbeat polling with a deterministic status file the agent reads every 3 minutes. Zero AI tokens for monitoring — pure OS-level process checks and filesystem diffs.

globalcaos 2387

model-router

Automatic LLM model selection for sub-agent tasks. Classifies tasks by complexity and type, then routes to the optimal model (cost vs capability). Use when spawning sub-agents, choosing models for cron jobs, or deciding which model to use for any task. Eliminates manual model specification by providing a decision tree and optional cheap-model classifier for ambiguous cases.

globalcaos 2387