AI Engineering Academy · Lesson

Calibrating Judge Models Against Humans

Collect a ground-truth dataset of human preference judgments, measure judge-human agreement with Cohen's Kappa, and prompt-tune the judge to reduce systematic biases.

Why Calibration Is Non-Negotiable

An uncalibrated LLM judge may systematically score outputs higher or lower than humans, prefer a specific writing style, or fail on edge cases your rubric did not anticipate. If you use such a judge to make deployment decisions, you are trusting a biased instrument. Calibration validates the judge against human ground truth and quantifies how much you can trust its scores before relying on them in production.

Building a Calibration Dataset

Create a calibration set of 100-500 example responses that have been scored by humans. Aim for diversity: include high-quality responses (score 5), medium quality (score 3), and clearly bad responses (score 1). Have 3-5 independent human raters score each example to measure inter-rater agreement and compute a consensus ground-truth score by averaging or taking the mode.

# Calibration dataset structure:
# [
#   {
#     'question': 'What causes inflation?',
#     'response': '...',
#     'human_scores': [4, 4, 3, 5, 4],  # 5 raters
#     'human_consensus': 4,              # mean or mode
#     'rater_ids': ['r1', 'r2', 'r3', 'r4', 'r5']
#   },
#   ...
# ]

# Typical calibration set composition:
# 30% excellent responses (score 4-5)
# 40% adequate responses (score 2-4)
# 30% poor responses (score 1-2)
# Include adversarial / edge cases

All lessons in this course

The LLM-as-Judge Pattern
Pointwise and Pairwise Evaluation
Calibrating Judge Models Against Humans
Building a Continuous Evaluation Pipeline

← Back to AI Engineering Academy