llm-evaluator
LLM-as-a-Judge evaluation system using Langfuse. Score AI outputs on relevance, accuracy, hallucination, and helpfulness. Backfill scoring on historical traces. Uses GPT-5-nano for cost-efficient judging. Use when evaluating AI quality, building evals, or monitoring output accuracy.
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/aiwithabidi/llm-evaluatorWhat This Skill Does
The llm-evaluator skill is a sophisticated LLM-as-a-Judge framework integrated directly into the OpenClaw agent ecosystem. It leverages Langfuse to provide a robust observability and evaluation layer for your AI outputs. By utilizing the cost-efficient GPT-5-nano model, it evaluates individual interaction traces across four critical dimensions: relevance, accuracy, hallucination detection, and overall helpfulness. This skill transforms raw agent logs into actionable quality metrics, allowing developers to maintain high standards of performance without excessive manual review. It is designed to act as an automated quality assurance gate, providing standardized 0-1 scores that allow you to track the performance of your AI agents over time.
Installation
To integrate this skill into your environment, use the OpenClaw command-line interface. Run the following command in your terminal:
clawhub install openclaw/skills/skills/aiwithabidi/llm-evaluator
Ensure that your environment variables for Langfuse and the required LLM API keys are correctly configured in your project settings to allow the evaluator to communicate with the trace management dashboard.
Use Cases
- Continuous Monitoring: Regularly audit production logs to ensure your agent remains helpful and accurate as you update system prompts or switch base models.
- Regression Testing: During development, run the evaluator against a set of 'gold standard' test cases to ensure new changes do not introduce hallucinations or relevance drops.
- Automated QA: Integrate this into your CI/CD pipeline to flag low-score outputs for manual human review before they reach end-users.
- Historical Analysis: Use the
backfillfunctionality to gain insights into the historical performance of your agent after updating your evaluation criteria.
Example Prompts
- "Evaluate the last 50 traces in our production log for hallucinations and give me a summary of the average accuracy score."
- "Run the relevance evaluator on trace ID 8892 and output the detailed reasoning for the score."
- "Backfill the scoring for the past 24 hours of activity to see if our recent model fine-tuning improved helpfulness scores."
Tips & Limitations
- Cost Efficiency: While GPT-5-nano is optimized for cost, running large-scale backfills on thousands of traces will still incur API costs. Use the
--limitflag during batch operations to manage expenses. - Scope: The quality of the evaluation is dependent on the clarity of your initial system prompts. If the agent's instructions were ambiguous, the evaluator may struggle to determine 'accuracy'.
- Integration: Ensure your Langfuse project is properly initialized, as the skill relies on valid trace IDs stored within that ecosystem to function correctly.
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-aiwithabidi-llm-evaluator": {
"enabled": true,
"auto_update": true
}
}
}Tags(AI)
Flags: external-api, file-read
Related Skills
freshsales
Freshsales CRM integration — manage contacts, leads, deals, accounts, tasks, and sales sequences via the Freshsales API. Track deal pipelines, automate lead assignments, log activities, and generate sales reports. Built for AI agents — Python stdlib only, no dependencies. Use for sales CRM, contact management, deal tracking, pipeline reporting, and sales automation.
gemini-video-analyzer
Native video analysis using Google Gemini API. Upload and analyze video files — describe scenes, extract text/UI, answer questions about content, transcribe speech, identify objects and actions. Use when: (1) User sends a video file and wants it analyzed, (2) Video summarization or description needed, (3) Extracting text, UI elements, or information from screen recordings, (4) Answering questions about video content, (5) Comparing multiple videos, (6) Analyzing tutorials, demos, or walkthroughs.
agent-memory
Full AI agent memory stack — Mem0 unified memory engine with vector search (Qdrant) and knowledge graph (Neo4j), plus SQLite for structured data. Complete setup script and tools. Give your OpenClaw agent a real brain with semantic recall, entity relationships, and structured storage.
neon
Neon serverless Postgres — manage projects, branches, databases, roles, endpoints, and compute via the Neon API. Create database branches for development, manage connection endpoints, scale compute, and monitor usage. Built for AI agents — Python stdlib only, zero dependencies. Use for serverless Postgres, database branching, database management, development workflows, and cloud database automation.
onepassword
1Password Connect — vaults, items, secrets management for server-side applications.