Official Verified developer tools Safety 4/5

LLM Evaluator

LLM-as-a-Judge evaluation system with Langfuse integration

Why use this skill?

Automate your AI evaluation with the LLM-as-a-Judge system for OpenClaw. Score relevance, accuracy, and hallucinations via Langfuse.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/aiwithabidi/agxntsix-llm-evaluator

Download Source Code (.zip)

What This Skill Does

The LLM Evaluator is a professional-grade LLM-as-a-Judge system designed to integrate seamlessly with Langfuse. This skill automates the quality assurance process for AI agent outputs, removing the need for manual inspection of every interaction. By utilizing the cost-efficient GPT-5-nano model, it provides systematic scoring on four core dimensions: relevance, accuracy, hallucination, and helpfulness. It excels at parsing complex trace logs to quantify performance, allowing developers to identify model drift, logical inconsistencies, or problematic hallucinations in real-time. Whether you are running a single test case or backfilling historical data to optimize your agent's performance, this skill serves as the objective arbiter of your AI's reasoning capabilities.

Installation

To integrate this skill into your OpenClaw environment, ensure your system meets the Python 3.10+ requirement and has the necessary dependencies (langfuse and requests) installed. Run the following command in your terminal:

clawhub install openclaw/skills/skills/aiwithabidi/agxntsix-llm-evaluator

After installation, you must configure your environment variables: OPENROUTER_API_KEY for the judge model, along with LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, and LANGFUSE_HOST. Proper environment configuration is essential for the evaluator to bridge your OpenClaw agent data with your Langfuse dashboard.

Use Cases

Continuous Integration (CI): Run automated evaluations on test suites during development to catch performance regressions.
Production Monitoring: Backfill historical traces during off-peak hours to audit agent behavior over time.
Fine-tuning Preparation: Aggregate accuracy scores to identify weak points in agent knowledge, creating a dataset for future fine-tuning.
Quality Assurance: Ensure consistent response standards across customer-facing AI agents by setting threshold scores for deployment.

Example Prompts

"OpenClaw, run the LLM evaluator on the last 20 traces in my project and report the average helpfulness score."
"Please backfill the evaluation scores for trace ID 'trace_12345' using the hallucination and accuracy benchmarks."
"Evaluate the latest agent response against the relevance metric and alert me if the score falls below 0.8."

Tips & Limitations

Cost Efficiency: While GPT-5-nano is optimized for cost, frequent evaluation of high-volume logs will consume API credits; use the --limit flag for targeted analysis.
Metric Tuning: Not all metrics apply to every task. Use specific flags to skip unnecessary checks, such as disabling hallucination checks for creative writing tasks.
Data Privacy: Because this skill interacts with your Langfuse instance, ensure that PII (Personally Identifiable Information) is redacted in your traces before sending them to the judge model.
Network Dependency: Reliable connectivity to the Langfuse host is required for successful scoring. If the API is unreachable, the evaluation will queue until the connection is restored.

Read Full Documentation on GitHub

Metadata

Author@aiwithabidi

Stars1601

Updated2026-02-27

View Author Profile

AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill

Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-aiwithabidi-agxntsix-llm-evaluator": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Tags(AI)

#llm-evaluation#ai-observability#quality-assurance#langfuse#automated-testing

Safety Score: 4/5

Flags: network-access, external-api

Related Skills

agent-memory

Full AI agent memory stack — Mem0 unified memory engine with vector search (Qdrant) and knowledge graph (Neo4j), plus SQLite for structured data. Complete setup script and tools. Give your OpenClaw agent a real brain with semantic recall, entity relationships, and structured storage.

aiwithabidi 4473

freshsales

Freshsales CRM integration — manage contacts, leads, deals, accounts, tasks, and sales sequences via the Freshsales API. Track deal pipelines, automate lead assignments, log activities, and generate sales reports. Built for AI agents — Python stdlib only, no dependencies. Use for sales CRM, contact management, deal tracking, pipeline reporting, and sales automation.

aiwithabidi 4473

neon

Neon serverless Postgres — manage projects, branches, databases, roles, endpoints, and compute via the Neon API. Create database branches for development, manage connection endpoints, scale compute, and monitor usage. Built for AI agents — Python stdlib only, zero dependencies. Use for serverless Postgres, database branching, database management, development workflows, and cloud database automation.

aiwithabidi 4473

gemini-video-analyzer

Native video analysis using Google Gemini API. Upload and analyze video files — describe scenes, extract text/UI, answer questions about content, transcribe speech, identify objects and actions. Use when: (1) User sends a video file and wants it analyzed, (2) Video summarization or description needed, (3) Extracting text, UI elements, or information from screen recordings, (4) Answering questions about video content, (5) Comparing multiple videos, (6) Analyzing tutorials, demos, or walkthroughs.

aiwithabidi 4473

onepassword

1Password Connect — vaults, items, secrets management for server-side applications.

aiwithabidi 4473