Regression Testing Across Model Updates
Running test suites when upgrading from GPT-4 to GPT-4o or Claude 3 to 3.5.
Why Model Updates Break Prompts
LLM providers regularly update their models: GPT-4 → GPT-4o → GPT-4o-2024-11-20, Claude 3 → Claude 3.5 → Claude 3.7. Each update changes the model's behavior — often improving most tasks but occasionally regressing on specific prompts.
Without a test suite, regressions are invisible until users report them. With a test suite, you detect regressions within minutes of a model update.
The Model Update Problem
Model updates can cause three types of changes:
- Improvements: previously failing test cases now pass — good
- Neutral: behavior unchanged — most tests
- Regressions: previously passing test cases now fail — must investigate
Even a 1% regression rate is significant: if you have 200 test cases and 2 start failing after a model update, those 2 may be your most critical use cases.
MODEL_HISTORY = [
{'model': 'gpt-4', 'deployed': '2023-03-14', 'pass_rate': 0.87},
{'model': 'gpt-4-turbo', 'deployed': '2023-11-06', 'pass_rate': 0.91},
{'model': 'gpt-4o', 'deployed': '2024-05-13', 'pass_rate': 0.93},
{'model': 'gpt-4o-2024-11-20', 'deployed': '2024-11-20', 'pass_rate': None}, # to be measured
]
# Goal: measure pass_rate for the new model before deploying to productionAll lessons in this course
- Writing Prompt Test Cases
- Assertion-Based Prompt Testing
- Regression Testing Across Model Updates
- Building a Prompt Test Suite