Aggregate Metrics Hide Failures
97% overall can mask one failing document type.
The Headline Number Lies
Your extraction pipeline reports 97% accuracy. The dashboard is green, the stakeholders are happy, and someone proposes turning off human review entirely.
Stop. A single aggregate number is one of the most dangerous artifacts in a production Claude system. That 97% is an average over a mixed population. Averages are excellent at smoothing away exactly the failures that hurt you most.
In this lesson you'll learn why aggregate-only accuracy is a documented anti-pattern, and what to measure instead before you automate human oversight away.
Anatomy of a Misleading Average
Imagine your pipeline processes three document types in equal volume. The blended score is 97%. Looks uniform, right?
But blended numbers are weighted by volume, not by risk. A small, high-stakes document type can be drowned out entirely. The aggregate tells you nothing about where the 3% of errors land — and in practice, errors are almost never spread evenly.
# Same 97% aggregate, two very different realities
docs = {
"invoices": {"n": 1000, "correct": 990}, # 99.0%
"receipts": {"n": 1000, "correct": 985}, # 98.5%
"contracts": {"n": 1000, "correct": 935}, # 93.5%
}
total = sum(d["n"] for d in docs.values())
hits = sum(d["correct"] for d in docs.values())
print(f"aggregate = {hits/total:.1%}") # 97.0% — hides contracts
for name, d in docs.items():
print(name, f"{d['correct']/d['n']:.1%}")All lessons in this course
- Claim to Source Mappings
- Conflicting Data & Dates
- Aggregate Metrics Hide Failures
- Stratified Sampling & Calibration