Claude Architect · Lesson

Stratified Sampling & Calibration

Sample by segment; calibrate with labeled validation sets.

Why Aggregate Accuracy Lies

You ship a Claude extraction pipeline and report 97% accuracy. Leadership approves full automation. Three weeks later, every refund invoice with a foreign-currency line is wrong.

The headline number was real — but it was an average. Aggregate accuracy can hide catastrophic failure on a specific document type or a specific field, because the common cases drown out the rare ones.

This lesson is about the discipline that keeps you honest before you automate: stratified sampling to measure where you actually fail, and calibration on labeled validation sets so your confidence scores mean something.

The Core Rule

Memorize the exam-level principle from the oversight domain:

Aggregate accuracy can hide poor performance on a specific document type or field. Use stratified random sampling plus field-level confidence calibrated on labeled validation sets before automating.

Two moves, in order:

Stratify, then sample — partition the population into segments, sample within each, so rare-but-critical slices are actually measured.
Calibrate confidence — make the model's per-field confidence correspond to real-world correctness, proven against ground-truth labels.

Skip either step and you are guessing, not governing.

All lessons in this course

← Back to Claude Architect