openclaw-rl-training
OpenClaw-RL framework for training personalized AI agents via reinforcement learning from natural conversation feedback
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/adisinghstudent/openclaw-rl-trainingOpenClaw-RL Training
Skill by ara.so — Daily 2026 Skills collection.
OpenClaw-RL is a fully asynchronous reinforcement learning framework that converts live multi-turn conversations into training signals for personalized AI agents. It wraps a self-hosted model as an OpenAI-compatible API via OpenClaw, intercepts conversations, and continuously optimizes the policy in the background without interrupting usage. It also supports scalable RL for terminal, GUI, SWE, and tool-call agents.
Architecture Overview
Four independent async loops that never block each other:
- Agent Serving — OpenClaw-compatible API serving rollouts
- Rollout Collection — Captures multi-turn conversations as training trajectories
- PRM/Judge Evaluation — Scores turns using next-state feedback (majority voting optional)
- Policy Training — GRPO/OPD/Combine training via slime or Tinker
Installation
git clone https://github.com/Gen-Verse/OpenClaw-RL
cd OpenClaw-RL
# Install core dependencies
pip install -r requirements.txt
# Install slime (training backend)
cd slime && pip install -e . && cd ..
# Optional: install SGLang for fast inference
pip install sglang
Project Structure
OpenClaw-RL/
├── openclaw-rl/ # Binary RL (GRPO) method
├── openclaw-opd/ # On-Policy Distillation method
├── openclaw-combine/ # Combined Binary RL + OPD
├── openclaw-test/ # Evaluation utilities
├── terminal-rl/ # Track 2: Terminal agent RL
├── gui-rl/ # Track 2: GUI agent RL
├── swe-rl/ # Track 2: SWE agent RL
├── toolcall-rl/ # Track 2: Tool-call agent RL
├── slime/ # Core training framework
└── openclaw/ # Runtime / API server
Three Learning Paradigms
1. Binary RL (GRPO)
A Process Reward Model scores each turn from next-state feedback. Uses GRPO advantage estimation with PPO-style clipped surrogate loss.
2. On-Policy Distillation (OPD)
When next state reveals useful hindsight, a judge extracts a textual hint to augment the prompt, creating an enhanced teacher. Token-level log-probability gap becomes a directional advantage signal.
3. Combination Method (Recommended)
Merges Binary RL scalar supervision with OPD token-level directional signal. Strongest and most robust optimization.
Quick Start — Personal Agent (Track 1)
Binary RL Launch Script
# openclaw-rl/run_qwen3_7b_openclaw_rl.sh
export MODEL_PATH=/path/to/qwen3-7b
export DATA_PATH=/path/to/conversation/data
export CKPT_SAVE_DIR=/path/to/checkpoints
bash openclaw-rl/run_qwen3_7b_openclaw_rl.sh
OPD Launch Script
export MODEL_PATH=/path/to/qwen3-7b
export JUDGE_MODEL_PATH=/path/to/judge-model
export DATA_PATH=/path/to/conversation/data
bash openclaw-opd/run_qwen3_7b_openclaw_opd.sh
Combination Method (One Line)
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-adisinghstudent-openclaw-rl-training": {
"enabled": true,
"auto_update": true
}
}
}Related Skills
Oh My Openagent Omo
Skill by adisinghstudent
Planning With Files Manus Workflow
Skill by adisinghstudent
mirofish-offline-simulation
Fully local multi-agent swarm intelligence simulation engine using Neo4j + Ollama for public opinion, market sentiment, and social dynamics prediction.
ghostling-libghostty-terminal
Build minimal terminal emulators using the libghostty-vt C API with Raylib for windowing and rendering
Obra Superpowers Agentic Workflow
Skill by adisinghstudent