AI Prompt Engineering · Lesson

Regression Testing Across Model Updates

Running test suites when upgrading from GPT-4 to GPT-4o or Claude 3 to 3.5.

Why Model Updates Break Prompts

LLM providers regularly update their models: GPT-4 → GPT-4o → GPT-4o-2024-11-20, Claude 3 → Claude 3.5 → Claude 3.7. Each update changes the model's behavior — often improving most tasks but occasionally regressing on specific prompts.

Without a test suite, regressions are invisible until users report them. With a test suite, you detect regressions within minutes of a model update.

The Model Update Problem

Model updates can cause three types of changes:

Improvements: previously failing test cases now pass — good
Neutral: behavior unchanged — most tests
Regressions: previously passing test cases now fail — must investigate

Even a 1% regression rate is significant: if you have 200 test cases and 2 start failing after a model update, those 2 may be your most critical use cases.

MODEL_HISTORY = [
    {'model': 'gpt-4', 'deployed': '2023-03-14', 'pass_rate': 0.87},
    {'model': 'gpt-4-turbo', 'deployed': '2023-11-06', 'pass_rate': 0.91},
    {'model': 'gpt-4o', 'deployed': '2024-05-13', 'pass_rate': 0.93},
    {'model': 'gpt-4o-2024-11-20', 'deployed': '2024-11-20', 'pass_rate': None},  # to be measured
]

# Goal: measure pass_rate for the new model before deploying to production

All lessons in this course

← Back to AI Prompt Engineering