AI Agents · Lesson

LLM-as-a-Judge Pitfalls

Judges are biased toward verbose answers, struggle with their own outputs, and need careful calibration.

Why LLM Judges?

For open-ended outputs (essays, code reviews, conversational answers), there is no single correct answer. Hiring humans to grade thousands of outputs is expensive.

LLM-as-a-Judge — using a strong model to score outputs — fills the gap.

Basic LLM Judge

judge_prompt = '''
You are an evaluator. On a scale of 1-5, rate the following answer
for accuracy and helpfulness. Return JSON: {score: int, reason: str}.

Question: {question}
Answer: {answer}
'''

All lessons in this course

Eval-Driven Development for Agents
Building a Golden Test Set
LLM-as-a-Judge Pitfalls
Benchmark Suites: SWE-Bench, GAIA, ToolBench

← Back to AI Agents