0Pricing
AI Prompt Engineering · Lesson

Evaluation and Selection in Self-Improvement

How to judge which generated prompts are better and select winners.

Why Evaluation Is the Hard Part

Generating prompt variants is easy. Evaluating which variant is actually better is the hard part of self-improvement. Without rigorous evaluation, you cannot tell if a rewritten prompt is genuinely improved or just different. Evaluation quality determines the quality of the entire improvement loop.

Scoring Criteria Design

Before running a self-improvement loop, define what 'better' means for your task. The scoring criteria should be objective, measurable, and directly tied to the task's success conditions.

# Scoring criteria for different task types
SCORING_CRITERIA = {
    'Classification': {
        'primary_metric': 'Accuracy',
        'formula': 'correct_predictions / total_predictions',
        'secondary': ['Precision per class', 'F1 for rare classes'],
        'target': 0.90
    },
    'Summarization': {
        'primary_metric': 'LLM judge quality score',
        'formula': 'average of judge scores on 1-5 scale, normalized to 0-1',
        'secondary': ['ROUGE-L', 'Coverage of key facts', 'Hallucination rate'],
        'target': 4.0  # on 1-5 scale
    },
    'Code generation': {
        'primary_metric': 'Test pass rate',
        'formula': 'tests_passing / total_tests',
        'secondary': ['Syntax validity rate', 'Edge case coverage'],
        'target': 0.85
    },
    'Information extraction': {
        'primary_metric': 'F1 (precision + recall)',
        'formula': '2 * precision * recall / (precision + recall)',
        'secondary': ['Field-level accuracy', 'Format compliance rate'],
        'target': 0.88
    }
}

for task, criteria in SCORING_CRITERIA.items():
    print(f'{task}: {criteria["primary_metric"]} (target: {criteria["target"]})')

All lessons in this course

  1. What Is Meta-Prompting?
  2. Prompts That Generate Prompts
  3. Self-Improving Prompt Systems
  4. Evaluation and Selection in Self-Improvement
← Back to AI Prompt Engineering