AI Engineering Academy · Lesson

Retrieval Metrics: Hit Rate, MRR, and NDCG

Build a golden dataset of queries and relevant documents, then compute hit rate, mean reciprocal rank, and NDCG to measure how often your retriever finds the right chunks.

Why Different Retrieval Metrics?

Hit rate tells you whether a relevant chunk appeared anywhere in the top-K results, but it does not tell you where in the ranking it appeared. A system that always puts the best chunk at rank 5 is worse than one that consistently puts it at rank 1, even if both have the same hit rate. More nuanced metrics like MRR and NDCG capture ranking quality, rewarding systems that put the most relevant chunks at the top where the LLM and users are most likely to use them.

Hit Rate @K: Review and Implementation

Hit rate@K is the simplest metric: for what fraction of queries does at least one relevant chunk appear in the top-K results? It gives a binary signal per query and is easy to interpret. Compute it by checking set intersection between retrieved IDs and the known relevant IDs. Use K=5 as the default since most RAG systems retrieve 5 chunks. Compare hit rate@1, @3, and @5 to understand how sensitivity changes as you expand the retrieval window.

def hit_rate_at_k(golden_dataset, retriever, k=5):
    hits = 0
    for item in golden_dataset:
        retrieved = retriever.retrieve(item['question'], top_k=k)
        retrieved_ids = [r['id'] for r in retrieved[:k]]
        relevant_ids = set(item['relevant_chunk_ids'])
        if any(rid in relevant_ids for rid in retrieved_ids):
            hits += 1
    return hits / len(golden_dataset)

for k in [1, 3, 5, 10]:
    hr = hit_rate_at_k(golden_dataset, retriever, k=k)
    print(f'Hit rate@{k}: {hr:.1%}')

All lessons in this course

Why Evaluation Matters in RAG
Retrieval Metrics: Hit Rate, MRR, and NDCG
Generation Metrics: Faithfulness and Answer Relevance
Building an Automated Evaluation Harness

← Back to AI Engineering Academy