Official Verified

subagent-testing

Test skills via RED/GREEN/REFACTOR TDD with fresh subagents

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/athola/nm-abstract-subagent-testing

Night Market Skill — ported from claude-night-market/abstract. For the full experience with agents, hooks, and commands, install the Claude Code plugin.

Subagent Testing - TDD for Skills

Test skills with fresh subagent instances to prevent priming bias and validate effectiveness.

Overview
Why Fresh Instances Matter
Testing Methodology
Quick Start
Detailed Testing Guide
Success Criteria

Overview

Fresh instances prevent priming: Each test uses a new Claude conversation to verify the skill's impact is measured, not conversation history effects.

Why Fresh Instances Matter

The Priming Problem

Running tests in the same conversation creates bias:

Prior context influences responses
Skill effects get mixed with conversation history
Can't isolate skill's true impact

Fresh Instance Benefits

Isolation: Each test starts clean
Reproducibility: Consistent baseline state
Measurement: Clear before/after comparison
Validation: Proves skill effectiveness, not priming

Testing Methodology

Three-phase TDD-style approach:

Phase 1: Baseline Testing (RED)

Test without skill to establish baseline behavior.

Phase 2: With-Skill Testing (GREEN)

Test with skill loaded to measure improvements.

Phase 3: Rationalization Testing (REFACTOR)

Test skill's anti-rationalization guardrails.

Quick Start

# 1. Create baseline tests (without skill)
# Use 5 diverse scenarios
# Document full responses

# 2. Create with-skill tests (fresh instances)
# Load skill explicitly
# Use identical prompts
# Compare to baseline

# 3. Create rationalization tests
# Test anti-rationalization patterns
# Verify guardrails work

Detailed Testing Guide

For complete testing patterns, examples, and templates:

Testing Patterns - Full TDD methodology
Test Examples - Baseline, with-skill, rationalization tests
Analysis Templates - Scoring and comparison frameworks

Success Criteria

Baseline: Document 5+ diverse baseline scenarios
Improvement: ≥50% improvement in skill-related metrics
Consistency: Results reproducible across fresh instances
Rationalization Defense: Guardrails prevent ≥80% of rationalization attempts

Related Skills

extract

Analyze a codebase and build a knowledge base of business logic, architecture, data flow, and engineering patterns. The foundation for gauntlet challenges and agent integration

athola 4473

discourse

>- Scan community discussion channels (HN, Lobsters, Reddit, tech blogs) for experience reports and opinions on a topic

athola 4473

synthesize

>- Merge, deduplicate, rank, and format research findings from multiple channels into a coherent report. Use after research agents return their results

athola 4473

workflow-monitor

Detect workflow failures and inefficient patterns, then create GitHub issues for improvement via /fix-workflow

athola 4473

architecture-paradigm-hexagonal

Hexagonal (Ports and Adapters) architecture isolating domain logic from infrastructure

athola 4473

subagent-testing

Install via CLI (Recommended)

Subagent Testing - TDD for Skills

Table of Contents

Overview

Why Fresh Instances Matter

The Priming Problem

Fresh Instance Benefits

Testing Methodology

Phase 1: Baseline Testing (RED)

Phase 2: With-Skill Testing (GREEN)

Phase 3: Rationalization Testing (REFACTOR)

Quick Start

Detailed Testing Guide

Success Criteria

See Also

Metadata

Related Skills

extract

discourse

synthesize

workflow-monitor

architecture-paradigm-hexagonal