AI Engineering Academy · Lesson

Generation Metrics: Faithfulness and Answer Relevance

Use RAGAS to measure whether generated answers are faithful to the retrieved context and whether they actually address the user's question without hallucinating.

Measuring the Generation Stage

Even when your retriever finds the perfect chunks, the generation stage can still fail. The LLM might ignore the retrieved context and answer from its parametric memory, misinterpret what the context says, or answer a slightly different question than what was asked. Generation metrics quantify these failures independently from retrieval so you can pinpoint and fix each problem. The two primary generation metrics are faithfulness and answer relevance.

Faithfulness: Definition

Faithfulness measures whether every claim in the generated answer can be directly traced back to the retrieved context. A faithful answer introduces no information that is not present in the context. Faithfulness is measured at the claim level: the answer is decomposed into individual atomic statements, and each is checked against the context for support. The faithfulness score is the fraction of claims that are supported.

# Faithfulness = supported_claims / total_claims

example_answer = (
    'Employees receive 15 vacation days per year. '
    'Remote work is allowed on Wednesdays and Fridays. '
    'The CEO is John Smith.'
)
example_context = (
    '15 vacation days are granted annually. '
    'Remote work is permitted on Wednesdays and Fridays.'
)

# Claim 1: 15 vacation days — SUPPORTED
# Claim 2: Remote work Wed+Fri — SUPPORTED
# Claim 3: CEO is John Smith — NOT IN CONTEXT (hallucinated)
# Faithfulness = 2/3 = 0.67

All lessons in this course

Why Evaluation Matters in RAG
Retrieval Metrics: Hit Rate, MRR, and NDCG
Generation Metrics: Faithfulness and Answer Relevance
Building an Automated Evaluation Harness

← Back to AI Engineering Academy