0PricingLogin
AI Prompt Engineering · Lesson

Comparative Judging: A vs B

Pairwise comparison prompts to rank outputs without absolute scoring.

What Is Pairwise (A vs B) Evaluation?

Pairwise evaluation (also called comparative judging) presents the judge with two responses to the same question and asks which is better. Instead of rating a single response 1-5, the judge makes a relative judgment: A is better, B is better, or they are tied.

This approach sidesteps some absolute scoring biases and often produces more reliable rankings than per-response absolute scores.

Basic Pairwise Judge Prompt

The simplest pairwise judge asks which response is better and why. The key is forcing a choice — don't let the judge avoid the comparison with a vague 'both are good'.

import anthropic
import json

client = anthropic.Anthropic(api_key='sk-ant-...')

PAIRWISE_PROMPT = (
    'Given the same question, compare these two responses and decide which is better.\n\n'
    'Question: {question}\n\n'
    'Response A:\n{response_a}\n\n'
    'Response B:\n{response_b}\n\n'
    'Which response is better? You must pick A, B, or TIE (use TIE only if '
    'they are truly equal in all meaningful ways).\n\n'
    'Return JSON: {{"winner": "A" or "B" or "TIE", '
    '"reason": "<one sentence explaining why the winner is better>"}}'
)

def pairwise_judge(question, response_a, response_b):
    prompt = PAIRWISE_PROMPT.format(
        question=question,
        response_a=response_a,
        response_b=response_b
    )
    r = client.messages.create(
        model='claude-opus-4-5',
        max_tokens=150,
        messages=[{'role': 'user', 'content': prompt}]
    )
    return json.loads(r.content[0].text)

result = pairwise_judge(
    'What is machine learning?',
    'Machine learning is a subset of AI.',
    'Machine learning is a method of data analysis that automates model building.'
)
print(result)

All lessons in this course

  1. Using LLM to Evaluate LLM Outputs
  2. Rubric-Based Scoring Prompts
  3. Comparative Judging: A vs B
  4. Calibration and Bias in LLM Judges
← Back to AI Prompt Engineering