agentic-eval
Patterns and techniques for evaluating and improving AI agent outputs. Use this skill when: - Implementing self-critique and reflection loops - Building evaluator-optimizer pipelines for quality-critical generation - Creating test-driven code refinement workflows - Designing rubric-based or LLM-as-judge evaluation systems - Adding iterative improvement to agent outputs (code, reports, analysis) - Measuring and improving agent response quality
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/boleyn/agentic-evalWhat This Skill Does
The agentic-eval skill provides a robust framework for implementing iterative self-improvement loops within AI agents. It shifts the paradigm from standard single-shot generation to sophisticated cycles of 'Generate, Evaluate, Critique, and Refine.' By leveraging this skill, developers can implement multi-step workflows where an agent acts as both the creator and the critic of its own work, significantly boosting the quality, accuracy, and reliability of complex outputs like technical documentation, software code, and structured data analysis. The skill encompasses proven design patterns, including the basic reflection loop, the decoupled evaluator-optimizer architecture, and specialized test-driven refinement for code.
Installation
To integrate this skill into your OpenClaw environment, execute the following command in your terminal:
clawhub install openclaw/skills/skills/boleyn/agentic-eval
Ensure you have the latest version of the OpenClaw CLI configured before attempting installation.
Use Cases
- Automated Code Review: Generate complex functions and automatically refine them based on failed unit tests or linting errors.
- Quality-Critical Content Creation: Draft technical reports or whitepapers, using a secondary critique loop to ensure strict adherence to style guides and factual accuracy.
- Constraint-Satisfying Tasks: Solve logic puzzles or data formatting tasks where every output must strictly meet predefined JSON schema or specific rubric constraints.
- Iterative Research Assistance: Perform multi-step synthesis of information where the agent evaluates its own findings against provided sources to minimize hallucinations.
Example Prompts
- "Use agentic-eval to write a Python script that scrapes the provided website, then run a self-critique loop to ensure the code handles edge-case errors like 404s and timeouts."
- "Draft a summary of the quarterly financial report using the reflection pattern. Evaluate the summary for clarity and tone, iterating until it meets a professional executive standard."
- "Refine the current system architecture document. Use the evaluator-optimizer pattern to check the draft against our internal documentation rubrics and improve sections that score below 0.8."
Tips & Limitations
- Structured Output: Always prioritize JSON when using evaluators. It is significantly more reliable for downstream programmatic processing than free-form text.
- Cost Considerations: Be aware that iterative loops increase token usage exponentially with each iteration. Always set a 'max_iterations' limit to prevent runaway costs.
- Prompt Engineering: The efficacy of this skill is entirely dependent on the quality of your criteria. Ambiguous criteria lead to poor feedback loops; be as specific as possible in your rubrics.
- Deterministic Evaluation: Where possible, combine LLM-as-judge with deterministic checks (like static code analyzers or unit tests) to minimize bias in the self-evaluation process.
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-boleyn-agentic-eval": {
"enabled": true,
"auto_update": true
}
}
}Tags(AI)
Flags: code-execution, external-api
Related Skills
xiaolongxia-assistant
OpenClaw 插件开发助手,输出可运行的插件骨架、安装命令和调试步骤。
Ocms Ai Prompt Generator
Skill by boleyn
ai-prompt-engineering-safety-review
Comprehensive AI prompt engineering safety review and improvement prompt. Analyzes prompts for safety, bias, security vulnerabilities, and effectiveness while providing detailed improvement recommendations with extensive frameworks, testing methodologies, and educational content.
xiaolongxia-assistant
OpenClaw 插件开发助手,输出可运行的插件骨架、安装命令和调试步骤。
ab-test-setup
When the user wants to plan, design, or implement an A/B test or experiment. Also use when the user mentions "A/B test," "split test," "experiment," "test this change," "variant copy," "multivariate test," or "hypothesis." For tracking implementation, see analytics-tracking.