ClawKit Logo
ClawKitReliability Toolkit
Back to Registry
Official Verified developer tools Safety 4/5

agentic-eval

Patterns and techniques for evaluating and improving AI agent outputs. Use this skill when: - Implementing self-critique and reflection loops - Building evaluator-optimizer pipelines for quality-critical generation - Creating test-driven code refinement workflows - Designing rubric-based or LLM-as-judge evaluation systems - Adding iterative improvement to agent outputs (code, reports, analysis) - Measuring and improving agent response quality

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/boleyn/agentic-eval
Or

What This Skill Does

The agentic-eval skill provides a robust framework for implementing iterative self-improvement loops within AI agents. It shifts the paradigm from standard single-shot generation to sophisticated cycles of 'Generate, Evaluate, Critique, and Refine.' By leveraging this skill, developers can implement multi-step workflows where an agent acts as both the creator and the critic of its own work, significantly boosting the quality, accuracy, and reliability of complex outputs like technical documentation, software code, and structured data analysis. The skill encompasses proven design patterns, including the basic reflection loop, the decoupled evaluator-optimizer architecture, and specialized test-driven refinement for code.

Installation

To integrate this skill into your OpenClaw environment, execute the following command in your terminal: clawhub install openclaw/skills/skills/boleyn/agentic-eval Ensure you have the latest version of the OpenClaw CLI configured before attempting installation.

Use Cases

  • Automated Code Review: Generate complex functions and automatically refine them based on failed unit tests or linting errors.
  • Quality-Critical Content Creation: Draft technical reports or whitepapers, using a secondary critique loop to ensure strict adherence to style guides and factual accuracy.
  • Constraint-Satisfying Tasks: Solve logic puzzles or data formatting tasks where every output must strictly meet predefined JSON schema or specific rubric constraints.
  • Iterative Research Assistance: Perform multi-step synthesis of information where the agent evaluates its own findings against provided sources to minimize hallucinations.

Example Prompts

  1. "Use agentic-eval to write a Python script that scrapes the provided website, then run a self-critique loop to ensure the code handles edge-case errors like 404s and timeouts."
  2. "Draft a summary of the quarterly financial report using the reflection pattern. Evaluate the summary for clarity and tone, iterating until it meets a professional executive standard."
  3. "Refine the current system architecture document. Use the evaluator-optimizer pattern to check the draft against our internal documentation rubrics and improve sections that score below 0.8."

Tips & Limitations

  • Structured Output: Always prioritize JSON when using evaluators. It is significantly more reliable for downstream programmatic processing than free-form text.
  • Cost Considerations: Be aware that iterative loops increase token usage exponentially with each iteration. Always set a 'max_iterations' limit to prevent runaway costs.
  • Prompt Engineering: The efficacy of this skill is entirely dependent on the quality of your criteria. Ambiguous criteria lead to poor feedback loops; be as specific as possible in your rubrics.
  • Deterministic Evaluation: Where possible, combine LLM-as-judge with deterministic checks (like static code analyzers or unit tests) to minimize bias in the self-evaluation process.

Metadata

Author@boleyn
Stars4190
Views0
Updated2026-04-18
View Author Profile
AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill
Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-boleyn-agentic-eval": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Tags(AI)

#self-reflection#quality-assurance#iterative-refinement#ai-agent#prompt-engineering
Safety Score: 4/5

Flags: code-execution, external-api