rag-eval
Evaluate your RAG pipeline quality using Ragas metrics (faithfulness, answer relevancy, context precision).
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/jonathanjing/rag-evalRAG Eval β Quality Testing for Your RAG Pipeline
Test and monitor your RAG pipeline's output quality.
π οΈ Installation
1. Ask OpenClaw (Recommended)
Tell OpenClaw: "Install the rag-eval skill." The agent will handle the installation and configuration automatically.
2. Manual Installation (CLI)
If you prefer the terminal, run:
clawhub install rag-eval
β οΈ Prerequisites
- Your OpenClaw must have a RAG system (vector DB + retrieval pipeline). This skill evaluates the output quality of that pipeline β it does not provide RAG functionality itself.
- At least one LLM API key is required β Ragas uses an LLM as judge internally. Set one of:
OPENAI_API_KEY(default, uses GPT-4o)ANTHROPIC_API_KEY(uses Claude Haiku)RAGAS_LLM=ollama/llama3(for local/offline evaluation)
Setup (first run only)
bash scripts/setup.sh
This installs ragas, datasets, and other dependencies.
Single Response Evaluation
When user asks to evaluate an answer, collect:
- question β the original user question
- answer β the LLM output to evaluate
- contexts β list of text chunks used to generate the answer (retrieved docs)
β οΈ SECURITY: Never interpolate user content directly into shell commands. Write the input to a temp JSON file first, then pipe it to the evaluator:
# Step 1: Write input to a temp file (agent should use the write/edit tool, NOT echo)
# Write this JSON to /tmp/rag-eval-input.json using the file write tool:
# {"question": "...", "answer": "...", "contexts": ["chunk1", "chunk2"]}
# Step 2: Pipe the file to the evaluator
python3 scripts/run_eval.py < /tmp/rag-eval-input.json
# Step 3: Clean up
rm -f /tmp/rag-eval-input.json
Alternatively, use --input-file:
python3 scripts/run_eval.py --input-file /tmp/rag-eval-input.json
Output JSON:
{
"faithfulness": 0.92,
"answer_relevancy": 0.87,
"context_precision": 0.79,
"overall_score": 0.86,
"verdict": "PASS",
"flags": []
}
Post results to user with human-readable summary:
π§ͺ Eval Results
β’ Faithfulness: 0.92 β
(no hallucination detected)
β’ Answer Relevancy: 0.87 β
β’ Context Precision: 0.79 β οΈ (some irrelevant context retrieved)
β’ Overall: 0.86 β PASS
Save to memory/eval-results/YYYY-MM-DD.jsonl.
Batch Evaluation
For a JSONL dataset file (each line: {"question":..., "answer":..., "contexts":[...]}):
python3 scripts/batch_eval.py --input references/sample_dataset.jsonl --output memory/eval-results/batch-YYYY-MM-DD.json
Score Interpretation
| Score | Verdict | Meaning |
|---|---|---|
| 0.85+ | β PASS | Production-ready quality |
| 0.70-0.84 | β οΈ REVIEW | Needs improvement |
| < 0.70 | β FAIL | Significant quality issues |
Faithfulness Deep-Dive
Metadata
Not sure this is the right skill?
Describe what you want to build β we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-jonathanjing-rag-eval": {
"enabled": true,
"auto_update": true
}
}
}Related Skills
glass2claw
Ray-Ban glasses β voice command β WhatsApp β OpenClaw auto-routes your photo into the right database. Hands-free life logging.
openclaw-dashboard
Real-time operations dashboard for OpenClaw. Monitors sessions, costs, cron jobs, and gateway health. Use when installing the dashboard, starting the server, adding features, updating `api-server.js` routes, or changing `agent-dashboard.html`. Includes language toggle (EN/δΈζ), watchdog 24h uptime bar, and cost analysis.
skill-trust-auditor
Audit a ClawHub skill for security risks BEFORE installation.
gateway-watchdog
Monitor OpenClaw gateway health with a watchdog state machine, Discord alerts, cooldown dedupe, and isolated fallback deployment on macOS. Use when users want gateway failure detection, auto-recovery policy, and low-noise Discord incident notifications.
openclaw-tally
Tokens tell you how much you paid. Tasks tell you what you got. Tally tracks every OpenClaw task from start to finish β cost, complexity, and efficiency score.