LLM-as-a-Judge Pitfalls
Judges are biased toward verbose answers, struggle with their own outputs, and need careful calibration.
Why LLM Judges?
For open-ended outputs (essays, code reviews, conversational answers), there is no single correct answer. Hiring humans to grade thousands of outputs is expensive.
LLM-as-a-Judge — using a strong model to score outputs — fills the gap.
Basic LLM Judge
judge_prompt = '''
You are an evaluator. On a scale of 1-5, rate the following answer
for accuracy and helpfulness. Return JSON: {score: int, reason: str}.
Question: {question}
Answer: {answer}
'''