LLM Evaluator
LLM-as-a-Judge evaluation system with Langfuse integration
Why use this skill?
Automate your AI evaluation with the LLM-as-a-Judge system for OpenClaw. Score relevance, accuracy, and hallucinations via Langfuse.
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/aiwithabidi/agxntsix-llm-evaluatorWhat This Skill Does
The LLM Evaluator is a professional-grade LLM-as-a-Judge system designed to integrate seamlessly with Langfuse. This skill automates the quality assurance process for AI agent outputs, removing the need for manual inspection of every interaction. By utilizing the cost-efficient GPT-5-nano model, it provides systematic scoring on four core dimensions: relevance, accuracy, hallucination, and helpfulness. It excels at parsing complex trace logs to quantify performance, allowing developers to identify model drift, logical inconsistencies, or problematic hallucinations in real-time. Whether you are running a single test case or backfilling historical data to optimize your agent's performance, this skill serves as the objective arbiter of your AI's reasoning capabilities.
Installation
To integrate this skill into your OpenClaw environment, ensure your system meets the Python 3.10+ requirement and has the necessary dependencies (langfuse and requests) installed. Run the following command in your terminal:
clawhub install openclaw/skills/skills/aiwithabidi/agxntsix-llm-evaluator
After installation, you must configure your environment variables: OPENROUTER_API_KEY for the judge model, along with LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, and LANGFUSE_HOST. Proper environment configuration is essential for the evaluator to bridge your OpenClaw agent data with your Langfuse dashboard.
Use Cases
- Continuous Integration (CI): Run automated evaluations on test suites during development to catch performance regressions.
- Production Monitoring: Backfill historical traces during off-peak hours to audit agent behavior over time.
- Fine-tuning Preparation: Aggregate accuracy scores to identify weak points in agent knowledge, creating a dataset for future fine-tuning.
- Quality Assurance: Ensure consistent response standards across customer-facing AI agents by setting threshold scores for deployment.
Example Prompts
- "OpenClaw, run the LLM evaluator on the last 20 traces in my project and report the average helpfulness score."
- "Please backfill the evaluation scores for trace ID 'trace_12345' using the hallucination and accuracy benchmarks."
- "Evaluate the latest agent response against the relevance metric and alert me if the score falls below 0.8."
Tips & Limitations
- Cost Efficiency: While GPT-5-nano is optimized for cost, frequent evaluation of high-volume logs will consume API credits; use the
--limitflag for targeted analysis. - Metric Tuning: Not all metrics apply to every task. Use specific flags to skip unnecessary checks, such as disabling hallucination checks for creative writing tasks.
- Data Privacy: Because this skill interacts with your Langfuse instance, ensure that PII (Personally Identifiable Information) is redacted in your traces before sending them to the judge model.
- Network Dependency: Reliable connectivity to the Langfuse host is required for successful scoring. If the API is unreachable, the evaluation will queue until the connection is restored.
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-aiwithabidi-agxntsix-llm-evaluator": {
"enabled": true,
"auto_update": true
}
}
}Tags(AI)
Flags: network-access, external-api
Related Skills
agent-memory
Full AI agent memory stack — Mem0 unified memory engine with vector search (Qdrant) and knowledge graph (Neo4j), plus SQLite for structured data. Complete setup script and tools. Give your OpenClaw agent a real brain with semantic recall, entity relationships, and structured storage.
freshsales
Freshsales CRM integration — manage contacts, leads, deals, accounts, tasks, and sales sequences via the Freshsales API. Track deal pipelines, automate lead assignments, log activities, and generate sales reports. Built for AI agents — Python stdlib only, no dependencies. Use for sales CRM, contact management, deal tracking, pipeline reporting, and sales automation.
neon
Neon serverless Postgres — manage projects, branches, databases, roles, endpoints, and compute via the Neon API. Create database branches for development, manage connection endpoints, scale compute, and monitor usage. Built for AI agents — Python stdlib only, zero dependencies. Use for serverless Postgres, database branching, database management, development workflows, and cloud database automation.
gemini-video-analyzer
Native video analysis using Google Gemini API. Upload and analyze video files — describe scenes, extract text/UI, answer questions about content, transcribe speech, identify objects and actions. Use when: (1) User sends a video file and wants it analyzed, (2) Video summarization or description needed, (3) Extracting text, UI elements, or information from screen recordings, (4) Answering questions about video content, (5) Comparing multiple videos, (6) Analyzing tutorials, demos, or walkthroughs.
onepassword
1Password Connect — vaults, items, secrets management for server-side applications.