The LLM-as-Judge Pattern
Understand how to prompt a strong LLM to score or compare outputs on criteria like correctness, helpfulness, and tone, and why this scales better than human evaluation.
Why Automated Evaluation Is Necessary
Manually evaluating LLM outputs is slow and expensive. A team of human raters can evaluate a few hundred outputs per week, but a production system generates thousands per day. LLM-as-judge uses a powerful language model to evaluate the quality of other LLM outputs at scale, enabling automated regression testing and continuous quality monitoring without hiring an army of annotators.
The Core Idea: One LLM Evaluates Another
In LLM-as-judge, you send a system prompt defining evaluation criteria, the original question, the model's answer, and optionally a reference answer to a judge model (typically GPT-4o or Claude). The judge returns a numerical score, a label, or a ranking. This works because state-of-the-art models have sufficient understanding of quality concepts like correctness, helpfulness, and coherence.
JUDGE_SYSTEM_PROMPT = '''
You are an expert evaluator of AI-generated responses.
Given a question and an AI-generated answer, score the answer on:
- Correctness (0-5): Is the information factually accurate?
- Completeness (0-5): Does it fully address the question?
- Clarity (0-5): Is it easy to understand?
Return a JSON object with scores and a brief rationale.
'''