ClawKit Logo
ClawKitReliability Toolkit
Back to Registry
Official Verified developer tools Safety 4/5

llm-evaluator

LLM-as-a-Judge evaluator via Langfuse. Scores traces on relevance, accuracy, hallucination, and helpfulness using GPT-5-nano as judge. Supports single trace scoring, batch backfill, and test mode. Integrates with Langfuse dashboard for observability. Triggers: evaluate trace, score quality, check accuracy, backfill scores, test evaluator, LLM judge.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/aiwithabidi/llm-evaluator-pro
Or

What This Skill Does

The llm-evaluator is a sophisticated LLM-as-a-Judge system designed to automate the quality assurance of AI agent outputs within the OpenClaw ecosystem. By leveraging the advanced analytical capabilities of GPT-5-nano via Langfuse, this skill provides a standardized framework for evaluating model performance. It systematically assesses agent traces across four critical metrics: relevance, accuracy, hallucination detection, and overall helpfulness. Whether you are running a production deployment or iterating on agent prompts, this tool provides the observability required to maintain high standards of AI reliability. It bridges the gap between raw execution and data-driven improvements by storing all evaluation results directly within the Langfuse dashboard.

Installation

To integrate the llm-evaluator into your environment, use the OpenClaw command-line interface. Ensure your system has the necessary credentials configured for the OpenClaw framework, then execute the following installation command:

clawhub install openclaw/skills/skills/aiwithabidi/llm-evaluator-pro

Once installed, verify the setup by running the test suite included in the script directory to ensure connectivity with your Langfuse project instance.

Use Cases

This skill is indispensable for developers and businesses building AI agents. Common use cases include:

  • Post-Deployment Auditing: Automatically score traces from the last 24 hours to ensure that user queries are being handled correctly.
  • A/B Testing: Evaluate different prompt versions against the same query set to determine which yields higher accuracy.
  • Hallucination Mitigation: Automatically flag or score outputs that contain non-factual information, allowing you to intercept bad responses before they reach the end user.
  • Continuous Integration: Use the batch backfill feature to monitor the performance of your agent over time as you update your base model or system instructions.

Example Prompts

  1. "OpenClaw, please run the llm-evaluator to score the most recent trace ID 98765 for accuracy and hallucination."
  2. "I need to perform a quality check on our system; backfill scores for the last 50 unscored traces using the evaluator tool."
  3. "Evaluate the helpfulness of the last interaction with the customer support agent and save the metrics to Langfuse."

Tips & Limitations

  • Cost Efficiency: Because this skill uses GPT-5-nano, it is optimized for high-volume scoring without excessive API costs. However, monitor your Langfuse token consumption when backfilling large datasets.
  • Granularity: While you can score all metrics at once, scoring individual metrics (e.g., just 'relevance') can be faster when you are iterating on specific performance improvements.
  • Context Windows: Ensure your traces contain sufficient context for the judge model to perform an accurate assessment; if the input prompt or retrieved context is missing, the judge may be unable to verify factual accuracy.

Metadata

Stars4473
Views7
Updated2026-05-01
View Author Profile
AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill
Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-aiwithabidi-llm-evaluator-pro": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Tags(AI)

#evaluation#observability#quality-assurance#llm-judge#debugging
Safety Score: 4/5

Flags: external-api, code-execution

Related Skills

freshsales

Freshsales CRM integration — manage contacts, leads, deals, accounts, tasks, and sales sequences via the Freshsales API. Track deal pipelines, automate lead assignments, log activities, and generate sales reports. Built for AI agents — Python stdlib only, no dependencies. Use for sales CRM, contact management, deal tracking, pipeline reporting, and sales automation.

aiwithabidi 4473

gemini-video-analyzer

Native video analysis using Google Gemini API. Upload and analyze video files — describe scenes, extract text/UI, answer questions about content, transcribe speech, identify objects and actions. Use when: (1) User sends a video file and wants it analyzed, (2) Video summarization or description needed, (3) Extracting text, UI elements, or information from screen recordings, (4) Answering questions about video content, (5) Comparing multiple videos, (6) Analyzing tutorials, demos, or walkthroughs.

aiwithabidi 4473

agent-memory

Full AI agent memory stack — Mem0 unified memory engine with vector search (Qdrant) and knowledge graph (Neo4j), plus SQLite for structured data. Complete setup script and tools. Give your OpenClaw agent a real brain with semantic recall, entity relationships, and structured storage.

aiwithabidi 4473

neon

Neon serverless Postgres — manage projects, branches, databases, roles, endpoints, and compute via the Neon API. Create database branches for development, manage connection endpoints, scale compute, and monitor usage. Built for AI agents — Python stdlib only, zero dependencies. Use for serverless Postgres, database branching, database management, development workflows, and cloud database automation.

aiwithabidi 4473

onepassword

1Password Connect — vaults, items, secrets management for server-side applications.

aiwithabidi 4473