0PricingLogin
Claude Architect · Lesson

Aggregate Metrics Hide Failures

97% overall can mask one failing document type.

The Headline Number Lies

Your extraction pipeline reports 97% accuracy. The dashboard is green, the stakeholders are happy, and someone proposes turning off human review entirely.

Stop. A single aggregate number is one of the most dangerous artifacts in a production Claude system. That 97% is an average over a mixed population. Averages are excellent at smoothing away exactly the failures that hurt you most.

In this lesson you'll learn why aggregate-only accuracy is a documented anti-pattern, and what to measure instead before you automate human oversight away.

Anatomy of a Misleading Average

Imagine your pipeline processes three document types in equal volume. The blended score is 97%. Looks uniform, right?

But blended numbers are weighted by volume, not by risk. A small, high-stakes document type can be drowned out entirely. The aggregate tells you nothing about where the 3% of errors land — and in practice, errors are almost never spread evenly.

# Same 97% aggregate, two very different realities
docs = {
    "invoices":   {"n": 1000, "correct": 990},  # 99.0%
    "receipts":   {"n": 1000, "correct": 985},  # 98.5%
    "contracts":  {"n": 1000, "correct": 935},  # 93.5%
}
total = sum(d["n"] for d in docs.values())
hits = sum(d["correct"] for d in docs.values())
print(f"aggregate = {hits/total:.1%}")  # 97.0% — hides contracts
for name, d in docs.items():
    print(name, f"{d['correct']/d['n']:.1%}")

All lessons in this course

  1. Claim to Source Mappings
  2. Conflicting Data & Dates
  3. Aggregate Metrics Hide Failures
  4. Stratified Sampling & Calibration
← Back to Claude Architect