ClawKit Logo
ClawKitReliability Toolkit
Back to Registry
Official Verified developer tools Safety 4/5

llm-judge

Use when comparing two or more code implementations against a spec or requirements doc. Triggers on "which repo is better", "compare these implementations", "evaluate both solutions", "rank these codebases", or "judge which approach wins". Also covers choosing between competing PRs or vendor submissions solving the same problem. Does NOT review a single codebase for quality — use code review skills instead. Does NOT evaluate strategy docs — use strategy-review. Requires a spec file and 2+ repo paths.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/anderskev/llm-judge
Or

What This Skill Does

The llm-judge skill provides an automated framework for comparing code implementations across different repositories. It utilizes an LLM-as-judge methodology to objectively evaluate code quality. The process is broken into two distinct phases: Fact Gathering and Judging. In the first phase, dedicated agents analyze individual codebases based on a provided specification, identifying key architectural and functional details. In the second phase, specialized judge agents evaluate these findings across five core dimensions: functionality, security, test quality, overengineering, and dead code. By using weighted rubrics, the skill aggregates these inputs into a final, actionable score, enabling developers to make data-driven decisions when choosing between different code patterns or library implementations.

Installation

You can install this skill directly via the OpenClaw CLI using the following command:

clawhub install openclaw/skills/skills/anderskev/llm-judge

Ensure that your environment has sufficient permissions to read repository files and write to the .beagle directory.

Use Cases

  • Library Selection: Compare multiple open-source implementations to determine which is most secure or best engineered for your needs.
  • Code Review Standardization: Use the judge as a baseline for peer reviews to ensure consistent evaluation metrics across a team.
  • Refactoring Strategy: Evaluate legacy code against a proposed modern replacement to quantify improvements in test coverage and complexity reduction.
  • Benchmarking: Score how different team branches handle specific functional requirements against a shared technical specification.

Example Prompts

  1. "/beagle:llm-judge compare ./projects/auth-v1 and ./projects/auth-v2 using the spec at ./specs/auth-reqs.md"
  2. "/beagle:llm-judge evaluate repository-a, repository-b, and repository-c against the requirements in ./documentation/api-standard.md"
  3. "/beagle:llm-judge rank implementations for the new data pipeline using /specs/pipeline-design.md"

Tips & Limitations

  • Precision: The quality of the evaluation is highly dependent on the clarity and detail of your specification document. Be as explicit as possible.
  • Compute Costs: Because this skill spawns multiple agents in parallel, ensure you have sufficient token limits configured for your LLM provider.
  • Human-in-the-loop: Always treat the generated report as a supportive tool for decision-making rather than the final authority. Review the 'Verdict' section carefully to understand the rationale behind specific score deductions.

Metadata

Author@anderskev
Stars4473
Views0
Updated2026-05-01
View Author Profile
AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill
Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-anderskev-llm-judge": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Tags(AI)

#code-analysis#benchmarking#evaluation#software-engineering#automation
Safety Score: 4/5

Flags: file-read, file-write