Official Verified developer tools Safety 4/5

llm-judge

Use when comparing two or more code implementations against a spec or requirements doc. Triggers on "which repo is better", "compare these implementations", "evaluate both solutions", "rank these codebases", or "judge which approach wins". Also covers choosing between competing PRs or vendor submissions solving the same problem. Does NOT review a single codebase for quality — use code review skills instead. Does NOT evaluate strategy docs — use strategy-review. Requires a spec file and 2+ repo paths.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/anderskev/llm-judge

Download Source Code (.zip)

What This Skill Does

The llm-judge skill provides an automated framework for comparing code implementations across different repositories. It utilizes an LLM-as-judge methodology to objectively evaluate code quality. The process is broken into two distinct phases: Fact Gathering and Judging. In the first phase, dedicated agents analyze individual codebases based on a provided specification, identifying key architectural and functional details. In the second phase, specialized judge agents evaluate these findings across five core dimensions: functionality, security, test quality, overengineering, and dead code. By using weighted rubrics, the skill aggregates these inputs into a final, actionable score, enabling developers to make data-driven decisions when choosing between different code patterns or library implementations.

Installation

You can install this skill directly via the OpenClaw CLI using the following command:

clawhub install openclaw/skills/skills/anderskev/llm-judge

Ensure that your environment has sufficient permissions to read repository files and write to the .beagle directory.

Use Cases

Library Selection: Compare multiple open-source implementations to determine which is most secure or best engineered for your needs.
Code Review Standardization: Use the judge as a baseline for peer reviews to ensure consistent evaluation metrics across a team.
Refactoring Strategy: Evaluate legacy code against a proposed modern replacement to quantify improvements in test coverage and complexity reduction.
Benchmarking: Score how different team branches handle specific functional requirements against a shared technical specification.

Example Prompts

"/beagle:llm-judge compare ./projects/auth-v1 and ./projects/auth-v2 using the spec at ./specs/auth-reqs.md"
"/beagle:llm-judge evaluate repository-a, repository-b, and repository-c against the requirements in ./documentation/api-standard.md"
"/beagle:llm-judge rank implementations for the new data pipeline using /specs/pipeline-design.md"

Tips & Limitations

Precision: The quality of the evaluation is highly dependent on the clarity and detail of your specification document. Be as explicit as possible.
Compute Costs: Because this skill spawns multiple agents in parallel, ensure you have sufficient token limits configured for your LLM provider.
Human-in-the-loop: Always treat the generated report as a supportive tool for decision-making rather than the final authority. Review the 'Verdict' section carefully to understand the rationale behind specific score deductions.

Read Full Documentation on GitHub

Metadata

Author@anderskev

Stars4473

Updated2026-05-01

View Author Profile

AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill

Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-anderskev-llm-judge": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Tags(AI)

#code-analysis#benchmarking#evaluation#software-engineering#automation

Safety Score: 4/5

Flags: file-read, file-write

Related Skills

tutorial-docs

Tutorial patterns for documentation - learning-oriented guides that teach through guided doing

anderskev 4473

fetch-pr-feedback

Fetch review comments from a PR and evaluate with receive-feedback skill

anderskev 4473

swift-testing-code-review

Reviews Swift Testing code for proper use of

anderskev 4473

rust-testing-code-review

Reviews Rust test code for unit test patterns, integration test structure, async testing, mocking approaches, and property-based testing. Covers Rust 2024 edition changes including async fn in traits for mocks,

anderskev 4473

explanation-docs

Explanation documentation patterns for understanding-oriented content - conceptual guides that explain why things work the way they do

anderskev 4473