Building a Golden Test Set
Curate 50-200 high-quality (input, expected output) pairs that cover the long tail of real usage.
What Is a Gold Set?
A "golden test set" (or "gold set") is a curated collection of (input, expected output) pairs that defines what good behavior looks like.
Every change is measured against this set.
Properties of a Good Gold Set
- Diverse — covers common AND edge cases
- Curated by humans — not auto-generated
- Versioned — frozen in your repo
- Documented — each case has a rationale
All lessons in this course
- Eval-Driven Development for Agents
- Building a Golden Test Set
- LLM-as-a-Judge Pitfalls
- Benchmark Suites: SWE-Bench, GAIA, ToolBench