Calibrating Judge Models Against Humans
Collect a ground-truth dataset of human preference judgments, measure judge-human agreement with Cohen's Kappa, and prompt-tune the judge to reduce systematic biases.
Why Calibration Is Non-Negotiable
An uncalibrated LLM judge may systematically score outputs higher or lower than humans, prefer a specific writing style, or fail on edge cases your rubric did not anticipate. If you use such a judge to make deployment decisions, you are trusting a biased instrument. Calibration validates the judge against human ground truth and quantifies how much you can trust its scores before relying on them in production.
Building a Calibration Dataset
Create a calibration set of 100-500 example responses that have been scored by humans. Aim for diversity: include high-quality responses (score 5), medium quality (score 3), and clearly bad responses (score 1). Have 3-5 independent human raters score each example to measure inter-rater agreement and compute a consensus ground-truth score by averaging or taking the mode.
# Calibration dataset structure:
# [
# {
# 'question': 'What causes inflation?',
# 'response': '...',
# 'human_scores': [4, 4, 3, 5, 4], # 5 raters
# 'human_consensus': 4, # mean or mode
# 'rater_ids': ['r1', 'r2', 'r3', 'r4', 'r5']
# },
# ...
# ]
# Typical calibration set composition:
# 30% excellent responses (score 4-5)
# 40% adequate responses (score 2-4)
# 30% poor responses (score 1-2)
# Include adversarial / edge casesAll lessons in this course
- The LLM-as-Judge Pattern
- Pointwise and Pairwise Evaluation
- Calibrating Judge Models Against Humans
- Building a Continuous Evaluation Pipeline