Evaluating Tuned Models vs Base
A/B against the base model on your eval set — sometimes fine-tuning hurts more than helps.
Don't Trust Training Loss
Low training loss does not mean better in production. The model may overfit, lose general capability, or hurt unseen tasks.
Always evaluate the tuned model on a held-out eval set.
Three Must-Have Eval Splits
- Train — used in fine-tuning
- Validation — used to pick best checkpoint
- Test — never seen during training; ONLY for final eval
All lessons in this course
- When Fine-Tuning Beats Prompting
- Data Collection: Trajectories and Trace Replay
- LoRA and QLoRA for Cost-Efficient Tuning
- Evaluating Tuned Models vs Base