Evaluation and Selection in Self-Improvement
How to judge which generated prompts are better and select winners.
Why Evaluation Is the Hard Part
Generating prompt variants is easy. Evaluating which variant is actually better is the hard part of self-improvement. Without rigorous evaluation, you cannot tell if a rewritten prompt is genuinely improved or just different. Evaluation quality determines the quality of the entire improvement loop.
Scoring Criteria Design
Before running a self-improvement loop, define what 'better' means for your task. The scoring criteria should be objective, measurable, and directly tied to the task's success conditions.
# Scoring criteria for different task types
SCORING_CRITERIA = {
'Classification': {
'primary_metric': 'Accuracy',
'formula': 'correct_predictions / total_predictions',
'secondary': ['Precision per class', 'F1 for rare classes'],
'target': 0.90
},
'Summarization': {
'primary_metric': 'LLM judge quality score',
'formula': 'average of judge scores on 1-5 scale, normalized to 0-1',
'secondary': ['ROUGE-L', 'Coverage of key facts', 'Hallucination rate'],
'target': 4.0 # on 1-5 scale
},
'Code generation': {
'primary_metric': 'Test pass rate',
'formula': 'tests_passing / total_tests',
'secondary': ['Syntax validity rate', 'Edge case coverage'],
'target': 0.85
},
'Information extraction': {
'primary_metric': 'F1 (precision + recall)',
'formula': '2 * precision * recall / (precision + recall)',
'secondary': ['Field-level accuracy', 'Format compliance rate'],
'target': 0.88
}
}
for task, criteria in SCORING_CRITERIA.items():
print(f'{task}: {criteria["primary_metric"]} (target: {criteria["target"]})')All lessons in this course
- What Is Meta-Prompting?
- Prompts That Generate Prompts
- Self-Improving Prompt Systems
- Evaluation and Selection in Self-Improvement