Official Verified

rag-eval

Evaluate your RAG pipeline quality using Ragas metrics (faithfulness, answer relevancy, context precision).

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/jonathanjing/rag-eval

Download Source Code (.zip)

RAG Eval — Quality Testing for Your RAG Pipeline

Test and monitor your RAG pipeline's output quality.

🛠️ Installation

1. Ask OpenClaw (Recommended)

Tell OpenClaw: "Install the rag-eval skill." The agent will handle the installation and configuration automatically.

2. Manual Installation (CLI)

If you prefer the terminal, run:

clawhub install rag-eval

⚠️ Prerequisites

Your OpenClaw must have a RAG system (vector DB + retrieval pipeline). This skill evaluates the output quality of that pipeline — it does not provide RAG functionality itself.
At least one LLM API key is required — Ragas uses an LLM as judge internally. Set one of:
- OPENAI_API_KEY (default, uses GPT-4o)
- ANTHROPIC_API_KEY (uses Claude Haiku)
- RAGAS_LLM=ollama/llama3 (for local/offline evaluation)

Setup (first run only)

bash scripts/setup.sh

This installs ragas, datasets, and other dependencies.

Single Response Evaluation

When user asks to evaluate an answer, collect:

question — the original user question
answer — the LLM output to evaluate
contexts — list of text chunks used to generate the answer (retrieved docs)

⚠️ SECURITY: Never interpolate user content directly into shell commands. Write the input to a temp JSON file first, then pipe it to the evaluator:

# Step 1: Write input to a temp file (agent should use the write/edit tool, NOT echo)
# Write this JSON to /tmp/rag-eval-input.json using the file write tool:
# {"question": "...", "answer": "...", "contexts": ["chunk1", "chunk2"]}

# Step 2: Pipe the file to the evaluator
python3 scripts/run_eval.py < /tmp/rag-eval-input.json

# Step 3: Clean up
rm -f /tmp/rag-eval-input.json

Alternatively, use --input-file:

python3 scripts/run_eval.py --input-file /tmp/rag-eval-input.json

Output JSON:

{
  "faithfulness": 0.92,
  "answer_relevancy": 0.87,
  "context_precision": 0.79,
  "overall_score": 0.86,
  "verdict": "PASS",
  "flags": []
}

Post results to user with human-readable summary:

🧪 Eval Results
• Faithfulness: 0.92 ✅ (no hallucination detected)
• Answer Relevancy: 0.87 ✅
• Context Precision: 0.79 ⚠️ (some irrelevant context retrieved)
• Overall: 0.86 — PASS

Save to memory/eval-results/YYYY-MM-DD.jsonl.

Batch Evaluation

For a JSONL dataset file (each line: {"question":..., "answer":..., "contexts":[...]}):

python3 scripts/batch_eval.py --input references/sample_dataset.jsonl --output memory/eval-results/batch-YYYY-MM-DD.json

Score Interpretation

Score	Verdict	Meaning
0.85+	✅ PASS	Production-ready quality
0.70-0.84	⚠️ REVIEW	Needs improvement
< 0.70	❌ FAIL	Significant quality issues

Faithfulness Deep-Dive

Read Full Documentation on GitHub

Metadata

Author@jonathanjing

Stars1947

Updated2026-03-04

View Author Profile

AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill

Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-jonathanjing-rag-eval": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Safety NoteClawKit audits metadata but not runtime behavior. Use with caution.

Related Skills

glass2claw

Ray-Ban glasses → voice command → WhatsApp → OpenClaw auto-routes your photo into the right database. Hands-free life logging.

jonathanjing 1947

openclaw-dashboard

Real-time operations dashboard for OpenClaw. Monitors sessions, costs, cron jobs, and gateway health. Use when installing the dashboard, starting the server, adding features, updating `api-server.js` routes, or changing `agent-dashboard.html`. Includes language toggle (EN/中文), watchdog 24h uptime bar, and cost analysis.

jonathanjing 1947

skill-trust-auditor

Audit a ClawHub skill for security risks BEFORE installation.

jonathanjing 1947

gateway-watchdog

Monitor OpenClaw gateway health with a watchdog state machine, Discord alerts, cooldown dedupe, and isolated fallback deployment on macOS. Use when users want gateway failure detection, auto-recovery policy, and low-noise Discord incident notifications.

jonathanjing 1947

openclaw-tally

Tokens tell you how much you paid. Tasks tell you what you got. Tally tracks every OpenClaw task from start to finish — cost, complexity, and efficiency score.

jonathanjing 1947