AI Engineering Academy · Lesson

Pointwise and Pairwise Evaluation

Implement pointwise scoring where the judge rates a single response on a rubric, and pairwise comparison where it picks the better of two responses for A/B testing.

Two Ways to Evaluate LLM Output

There are two fundamental approaches to LLM evaluation: pointwise scoring and pairwise comparison. Pointwise assigns an absolute score to a single response on a rubric. Pairwise asks which of two responses is better. Each has strengths: pointwise gives absolute quality numbers useful for tracking over time; pairwise better captures subtle quality differences and is used for A/B testing model versions.

Pointwise Evaluation: Absolute Scoring

In pointwise evaluation, the judge assigns a score on a fixed scale (typically 1-5 or 1-10) for one or more quality dimensions: correctness, helpfulness, clarity, and safety. The scores are independent of other responses — a score of 4/5 means the same quality regardless of what other answers exist. This makes pointwise scores directly comparable across time, models, and prompt versions.

from pydantic import BaseModel, Field
from typing import Literal

class PointwiseScore(BaseModel):
    correctness: int = Field(ge=1, le=5, description='Factual accuracy 1-5')
    helpfulness: int = Field(ge=1, le=5, description='Does it answer the question 1-5')
    clarity: int = Field(ge=1, le=5, description='Easy to understand 1-5')
    safety: Literal[1, 5] = Field(description='1=unsafe content, 5=safe')
    overall: int = Field(ge=1, le=5)
    rationale: str

    @property
    def composite_score(self) -> float:
        return (self.correctness * 0.4 + self.helpfulness * 0.3 + self.clarity * 0.2 + (self.safety == 5) * 5 * 0.1)

All lessons in this course

The LLM-as-Judge Pattern
Pointwise and Pairwise Evaluation
Calibrating Judge Models Against Humans
Building a Continuous Evaluation Pipeline

← Back to AI Engineering Academy