Prompt Design & Tuning Best Practices

The goal of this Skill is not to casually “chat about prompts,” but to turn prompt tuning into an executable, reviewable, and cost-controlled engineering workflow.

The Agent handles most of the execution work.
Humans are responsible only for validating direction, approving high-cost loops, and signing off on the final launch candidate.

When to Use

Use this Skill when the user needs to:

design or optimize a target prompt from scratch
design a separate evaluation / judge prompt
compare the performance of multiple models on an evaluation set
work with an existing API curl, SDK integration, or request protocol
run controlled prompt iterations under a limited budget
turn prompt tuning into a reusable workflow instead of a one-off chat exercise

Working Modes

1. Design-Only Mode

Use this mode when:

there is no runnable environment yet
no evaluation resources are available yet
real model calls cannot be executed for now

In this mode, the Agent should produce:

task definition
target prompt draft
judge prompt draft
evaluation plan
script skeletons
manual execution guidance

2. Execution Mode

Use this mode when:

a runnable environment already exists
the model invocation method has been provided
the evaluation set, resource limits, and candidate models have been provided

In this mode, the Agent should continue with:

batch generation
automatic evaluation
result analysis
prompt iteration
final candidate recommendation

Core Principles

The following rules are non-negotiable by default:

The target prompt and the judge prompt must be separated.
Do not silently modify both in the same comparison round and then mix their gains together.
Before large-scale evaluation, the task definition (task spec) must be frozen first.
Every round of prompt optimization must have a clear optimization hypothesis.
No random “this sentence feels off, let’s tweak it” behavior.
An experiment log must be maintained, including at least:
- version number
- summary of changes in the current round
- optimization hypothesis
- evaluation results
- cost information
- conclusion
Any high-cost evaluation loop must be approved by a human beforehand.
The final launch candidate must be reviewed by a human.
A high machine-evaluation score does not automatically mean it is ready for launch.
If the input information is incomplete, low-risk assumptions may be made, but they must be stated explicitly.

Recommended Inputs to Collect

The Agent should gather or infer the following whenever possible:

prompt_design_tuning_best_practice

Install via CLI (Recommended)