llm-judge
Use when comparing two or more code implementations against a spec or requirements doc. Triggers on "which repo is better", "compare these implementations", "evaluate both solutions", "rank these codebases", or "judge which approach wins". Also covers choosing between competing PRs or vendor submissions solving the same problem. Does NOT review a single codebase for quality — use code review skills instead. Does NOT evaluate strategy docs — use strategy-review. Requires a spec file and 2+ repo paths.
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/anderskev/llm-judgeWhat This Skill Does
The llm-judge skill provides an automated framework for comparing code implementations across different repositories. It utilizes an LLM-as-judge methodology to objectively evaluate code quality. The process is broken into two distinct phases: Fact Gathering and Judging. In the first phase, dedicated agents analyze individual codebases based on a provided specification, identifying key architectural and functional details. In the second phase, specialized judge agents evaluate these findings across five core dimensions: functionality, security, test quality, overengineering, and dead code. By using weighted rubrics, the skill aggregates these inputs into a final, actionable score, enabling developers to make data-driven decisions when choosing between different code patterns or library implementations.
Installation
You can install this skill directly via the OpenClaw CLI using the following command:
clawhub install openclaw/skills/skills/anderskev/llm-judge
Ensure that your environment has sufficient permissions to read repository files and write to the .beagle directory.
Use Cases
- Library Selection: Compare multiple open-source implementations to determine which is most secure or best engineered for your needs.
- Code Review Standardization: Use the judge as a baseline for peer reviews to ensure consistent evaluation metrics across a team.
- Refactoring Strategy: Evaluate legacy code against a proposed modern replacement to quantify improvements in test coverage and complexity reduction.
- Benchmarking: Score how different team branches handle specific functional requirements against a shared technical specification.
Example Prompts
- "/beagle:llm-judge compare ./projects/auth-v1 and ./projects/auth-v2 using the spec at ./specs/auth-reqs.md"
- "/beagle:llm-judge evaluate repository-a, repository-b, and repository-c against the requirements in ./documentation/api-standard.md"
- "/beagle:llm-judge rank implementations for the new data pipeline using /specs/pipeline-design.md"
Tips & Limitations
- Precision: The quality of the evaluation is highly dependent on the clarity and detail of your specification document. Be as explicit as possible.
- Compute Costs: Because this skill spawns multiple agents in parallel, ensure you have sufficient token limits configured for your LLM provider.
- Human-in-the-loop: Always treat the generated report as a supportive tool for decision-making rather than the final authority. Review the 'Verdict' section carefully to understand the rationale behind specific score deductions.
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-anderskev-llm-judge": {
"enabled": true,
"auto_update": true
}
}
}Tags(AI)
Flags: file-read, file-write
Related Skills
tutorial-docs
Tutorial patterns for documentation - learning-oriented guides that teach through guided doing
fetch-pr-feedback
Fetch review comments from a PR and evaluate with receive-feedback skill
swift-testing-code-review
Reviews Swift Testing code for proper use of
rust-testing-code-review
Reviews Rust test code for unit test patterns, integration test structure, async testing, mocking approaches, and property-based testing. Covers Rust 2024 edition changes including async fn in traits for mocks,
explanation-docs
Explanation documentation patterns for understanding-oriented content - conceptual guides that explain why things work the way they do