AI Engineering Academy · Lesson

Measuring the Impact of Re-ranking

Run a before-and-after benchmark comparing single-stage retrieval versus two-stage with re-ranking, measuring NDCG, MRR, and end-to-end answer quality.

Why Measure Re-ranking Impact?

Re-ranking adds latency and cost to your pipeline. Without measurement, you cannot answer whether the added complexity is worth it. Benchmarking quantifies the improvement in retrieval quality and end-to-end answer quality so you can make an informed decision. It also reveals which query types benefit most, enabling you to apply re-ranking selectively rather than on every request.

Building a Golden Test Set

A reliable benchmark requires a golden test set: a collection of queries paired with the document IDs that are known to be relevant. Create it by sampling real user queries from your application logs, identifying the relevant documents manually or with expert annotation, and organizing them into a structured format. A test set of 50-200 queries is sufficient for most RAG evaluation purposes.

golden_test_set = [
    {
        'query': 'How does pgvector HNSW indexing improve search speed?',
        'relevant_doc_ids': ['doc_042', 'doc_107'],
    },
    {
        'query': 'What is the difference between BM25 and dense retrieval?',
        'relevant_doc_ids': ['doc_015'],
    },
    {
        'query': 'How to implement reciprocal rank fusion in Python?',
        'relevant_doc_ids': ['doc_093', 'doc_094'],
    },
    # ... 47 more entries
]

print(f'Test set size: {len(golden_test_set)} queries')
print(f'Avg relevant docs per query: {sum(len(e["relevant_doc_ids"]) for e in golden_test_set) / len(golden_test_set):.1f}')

All lessons in this course

Why Two-Stage Retrieval Works
Cross-Encoder Re-ranking with Cohere and BGE
Contextual Compression and Relevance Filtering
Measuring the Impact of Re-ranking

← Back to AI Engineering Academy