Calibration and Bias in LLM Judges
Position bias, verbosity bias, and how to mitigate them in judge prompts.
Why Judge Calibration Matters
An LLM judge that systematically scores one type of response higher than it deserves produces misleading evaluation results. You might ship a worse model because the judge preferred its verbose style — not its actual quality.
Calibration means the judge's scores accurately reflect true quality. A calibrated judge agrees with human raters at a measurable rate and doesn't systematically favor any one attribute unrelated to quality.
Position Bias: Deep Dive
Position bias is the strongest and most studied LLM judge bias. In pairwise comparison, judges prefer the first option 60-65% of the time independent of quality. This is equivalent to a coin that comes up heads 60% of the time — significant at scale.
The bias exists because LLMs are trained to generate continuations — seeing 'Response A:' first primes them toward A before they read B.
import anthropic
client = anthropic.Anthropic(api_key='sk-ant-...')
def measure_position_bias(question, n_pairs=20):
"""
Measure position bias by comparing IDENTICAL responses.
If both responses are the same, wins should be 50/50.
Any deviation from 50/50 is pure position bias.
"""
response = 'Machine learning is a subset of AI that learns from data.'
first_wins = 0
for _ in range(n_pairs):
prompt = (
f'Which response is better?\nQ: {question}\n'
f'Response A: {response}\n'
f'Response B: {response}\n'
f'Reply with A or B.'
)
r = client.messages.create(
model='claude-opus-4-5',
max_tokens=5,
messages=[{'role': 'user', 'content': prompt}]
)
if 'A' in r.content[0].text:
first_wins += 1
bias = first_wins / n_pairs
print(f'First-position win rate with IDENTICAL responses: {bias:.0%}')
print(f'Expected (no bias): 50%')
print(f'Measured bias: {(bias - 0.5) * 100:+.0f}%')
return biasAll lessons in this course
- Using LLM to Evaluate LLM Outputs
- Rubric-Based Scoring Prompts
- Comparative Judging: A vs B
- Calibration and Bias in LLM Judges