Official Verified developer tools Safety 4/5

agentic-eval

Patterns and techniques for evaluating and improving AI agent outputs. Use this skill when: - Implementing self-critique and reflection loops - Building evaluator-optimizer pipelines for quality-critical generation - Creating test-driven code refinement workflows - Designing rubric-based or LLM-as-judge evaluation systems - Adding iterative improvement to agent outputs (code, reports, analysis) - Measuring and improving agent response quality

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/boleyn/agentic-eval

Download Source Code (.zip)

What This Skill Does

The agentic-eval skill provides a robust framework for implementing iterative self-improvement loops within AI agents. It shifts the paradigm from standard single-shot generation to sophisticated cycles of 'Generate, Evaluate, Critique, and Refine.' By leveraging this skill, developers can implement multi-step workflows where an agent acts as both the creator and the critic of its own work, significantly boosting the quality, accuracy, and reliability of complex outputs like technical documentation, software code, and structured data analysis. The skill encompasses proven design patterns, including the basic reflection loop, the decoupled evaluator-optimizer architecture, and specialized test-driven refinement for code.

Installation

To integrate this skill into your OpenClaw environment, execute the following command in your terminal: clawhub install openclaw/skills/skills/boleyn/agentic-eval Ensure you have the latest version of the OpenClaw CLI configured before attempting installation.

Use Cases

Automated Code Review: Generate complex functions and automatically refine them based on failed unit tests or linting errors.
Quality-Critical Content Creation: Draft technical reports or whitepapers, using a secondary critique loop to ensure strict adherence to style guides and factual accuracy.
Constraint-Satisfying Tasks: Solve logic puzzles or data formatting tasks where every output must strictly meet predefined JSON schema or specific rubric constraints.
Iterative Research Assistance: Perform multi-step synthesis of information where the agent evaluates its own findings against provided sources to minimize hallucinations.

Example Prompts

"Use agentic-eval to write a Python script that scrapes the provided website, then run a self-critique loop to ensure the code handles edge-case errors like 404s and timeouts."
"Draft a summary of the quarterly financial report using the reflection pattern. Evaluate the summary for clarity and tone, iterating until it meets a professional executive standard."
"Refine the current system architecture document. Use the evaluator-optimizer pattern to check the draft against our internal documentation rubrics and improve sections that score below 0.8."

Tips & Limitations

Structured Output: Always prioritize JSON when using evaluators. It is significantly more reliable for downstream programmatic processing than free-form text.
Cost Considerations: Be aware that iterative loops increase token usage exponentially with each iteration. Always set a 'max_iterations' limit to prevent runaway costs.
Prompt Engineering: The efficacy of this skill is entirely dependent on the quality of your criteria. Ambiguous criteria lead to poor feedback loops; be as specific as possible in your rubrics.
Deterministic Evaluation: Where possible, combine LLM-as-judge with deterministic checks (like static code analyzers or unit tests) to minimize bias in the self-evaluation process.

Read Full Documentation on GitHub

Metadata

Author@boleyn

Stars4190

Updated2026-04-18

View Author Profile

AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill

Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-boleyn-agentic-eval": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Tags(AI)

#self-reflection#quality-assurance#iterative-refinement#ai-agent#prompt-engineering

Safety Score: 4/5

Flags: code-execution, external-api

Related Skills

xiaolongxia-assistant

OpenClaw 插件开发助手，输出可运行的插件骨架、安装命令和调试步骤。

boleyn 4190

Ocms Ai Prompt Generator

Skill by boleyn

boleyn 4190

ai-prompt-engineering-safety-review

Comprehensive AI prompt engineering safety review and improvement prompt. Analyzes prompts for safety, bias, security vulnerabilities, and effectiveness while providing detailed improvement recommendations with extensive frameworks, testing methodologies, and educational content.

boleyn 4190

xiaolongxia-assistant

OpenClaw 插件开发助手，输出可运行的插件骨架、安装命令和调试步骤。

boleyn 4190

ab-test-setup

When the user wants to plan, design, or implement an A/B test or experiment. Also use when the user mentions "A/B test," "split test," "experiment," "test this change," "variant copy," "multivariate test," or "hypothesis." For tracking implementation, see analytics-tracking.

boleyn 4190