autoresearch

Autonomously optimize any OpenClaw skill by running it repeatedly, scoring outputs against binary evals, mutating the prompt, and keeping improvements. Based on Karpathy's autoresearch methodology.

Triggers

Use when: optimize this skill, improve this skill, run autoresearch on, make this skill better, self-improve skill, benchmark skill, eval my skill, run evals on.

Description

Autonomous prompt/strategy optimization using Karpathy's autoresearch pattern. Mutate → evaluate → keep improvements. Works on anything with a measurable score: trading strategies, content scripts, thumbnails, ad copy, email subjects.

How It Works

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  1. BASELINE │────▶│  2. MUTATE   │────▶│  3. EVALUATE │────▶│  4. DECIDE   │
│  Score the   │     │  Change one  │     │  Run scoring │     │  Better?     │
│  current     │     │  thing       │     │  function    │     │  Keep : Revert│
│  version     │     │              │     │              │     │              │
└─────────────┘     └─────────────┘     └─────────────┘     └──────┬───────┘
                                                                    │
                                                              Loop back to 2

Instructions

Step 1: Identify the Mutable File

The mutable file is the thing you're optimizing. It can be:

A SKILL.md prompt/instructions
A trading strategy config (thresholds, parameters)
A content template (YouTube script format, ad copy structure)
Any text file where changes produce measurable differences

Create or identify this file. Example:

my-skill/
├── SKILL.md          ← this is your mutable file
├── eval/
│   ├── test_cases.json
│   └── score.py

Step 2: Create an Evaluation Function

Your eval function must:

Take the current mutable file as input
Run it against test cases
Return a numeric score (higher = better)

The eval can be anything:

LLM-as-judge: Send output to an LLM, ask it to score 1-100
Backtest: Run a strategy against historical data, measure Sharpe/returns
A/B metrics: CTR, engagement, conversion rate
Binary pass/fail: Count how many test cases pass out of N

Template eval function (customize for your domain):

# eval/score.py
import json
import sys

def evaluate(mutable_file_path: str, test_cases_path: str) -> float:
    """
    Score the current version of the mutable file.
    Returns a float — higher is better.
    """
    with open(mutable_file_path) as f:
        current_version = f.read()
    
    with open(test_cases_path) as f:
        test_cases = json.load(f)
    
    scores = []
    for case in test_cases:
        # YOUR SCORING LOGIC HERE
        # Example: run the prompt, compare output to expected
        score = run_and_score(current_version, case)
        scores.append(score)
    
    return sum(scores) / len(scores)

autoresearch

Install via CLI (Recommended)

autoresearch

Triggers

Description

How It Works

Instructions

Step 1: Identify the Mutable File

Step 2: Create an Evaluation Function

Metadata