Stratified Sampling & Calibration
Sample by segment; calibrate with labeled validation sets.
Why Aggregate Accuracy Lies
You ship a Claude extraction pipeline and report 97% accuracy. Leadership approves full automation. Three weeks later, every refund invoice with a foreign-currency line is wrong.
The headline number was real — but it was an average. Aggregate accuracy can hide catastrophic failure on a specific document type or a specific field, because the common cases drown out the rare ones.
This lesson is about the discipline that keeps you honest before you automate: stratified sampling to measure where you actually fail, and calibration on labeled validation sets so your confidence scores mean something.
The Core Rule
Memorize the exam-level principle from the oversight domain:
Aggregate accuracy can hide poor performance on a specific document type or field. Use stratified random sampling plus field-level confidence calibrated on labeled validation sets before automating.
Two moves, in order:
- Stratify, then sample — partition the population into segments, sample within each, so rare-but-critical slices are actually measured.
- Calibrate confidence — make the model's per-field confidence correspond to real-world correctness, proven against ground-truth labels.
Skip either step and you are guessing, not governing.
All lessons in this course
- Claim to Source Mappings
- Conflicting Data & Dates
- Aggregate Metrics Hide Failures
- Stratified Sampling & Calibration