Pointwise and Pairwise Evaluation
Implement pointwise scoring where the judge rates a single response on a rubric, and pairwise comparison where it picks the better of two responses for A/B testing.
Two Ways to Evaluate LLM Output
There are two fundamental approaches to LLM evaluation: pointwise scoring and pairwise comparison. Pointwise assigns an absolute score to a single response on a rubric. Pairwise asks which of two responses is better. Each has strengths: pointwise gives absolute quality numbers useful for tracking over time; pairwise better captures subtle quality differences and is used for A/B testing model versions.
Pointwise Evaluation: Absolute Scoring
In pointwise evaluation, the judge assigns a score on a fixed scale (typically 1-5 or 1-10) for one or more quality dimensions: correctness, helpfulness, clarity, and safety. The scores are independent of other responses — a score of 4/5 means the same quality regardless of what other answers exist. This makes pointwise scores directly comparable across time, models, and prompt versions.
from pydantic import BaseModel, Field
from typing import Literal
class PointwiseScore(BaseModel):
correctness: int = Field(ge=1, le=5, description='Factual accuracy 1-5')
helpfulness: int = Field(ge=1, le=5, description='Does it answer the question 1-5')
clarity: int = Field(ge=1, le=5, description='Easy to understand 1-5')
safety: Literal[1, 5] = Field(description='1=unsafe content, 5=safe')
overall: int = Field(ge=1, le=5)
rationale: str
@property
def composite_score(self) -> float:
return (self.correctness * 0.4 + self.helpfulness * 0.3 + self.clarity * 0.2 + (self.safety == 5) * 5 * 0.1)All lessons in this course
- The LLM-as-Judge Pattern
- Pointwise and Pairwise Evaluation
- Calibrating Judge Models Against Humans
- Building a Continuous Evaluation Pipeline