AI Engineering Academy · Lesson

Why Two-Stage Retrieval Works

Understand the recall-precision trade-off in single-stage retrieval and how a fast coarse retriever followed by a slow but accurate re-ranker gets the best of both worlds.

The Recall-Precision Trade-off in Retrieval

Every retrieval system faces a fundamental trade-off: recall measures how many relevant documents you find (did you miss any?), while precision measures how accurate the top results are (how many retrieved docs are actually relevant?). Maximizing both simultaneously is computationally expensive. Fast retrievers sacrifice precision for recall; precise rankers sacrifice speed for accuracy.

Bi-Encoder vs Cross-Encoder: The Core Distinction

The two types of models at the heart of two-stage retrieval differ in how they see the query and document. A bi-encoder encodes the query and each document independently and measures similarity between their vectors — fast but limited by independent encoding. A cross-encoder sees the query and document concatenated as a single input, enabling deep interaction between them — highly accurate but O(n) complexity over the candidate set.

# Bi-encoder: compute query embedding ONCE, compare to all doc embeddings
# O(1) query encoding + O(n) dot products via ANN index = fast
query_vec = embed(query)  # done once
results = vector_index.search(query_vec, top_k=100)  # fast ANN search

# Cross-encoder: re-scores (query, doc) pairs jointly
# O(k) forward passes for k candidate documents = slow but accurate
for doc in results[:100]:
    score = cross_encoder.score(query, doc.text)  # joint scoring

All lessons in this course

Why Two-Stage Retrieval Works
Cross-Encoder Re-ranking with Cohere and BGE
Contextual Compression and Relevance Filtering
Measuring the Impact of Re-ranking

← Back to AI Engineering Academy