Re-ranking Retrieved Chunks
Cross-encoder re-ranking.
Why Re-Rank at All
First-stage retrieval (dense or sparse) optimizes for recall at scale: get the gold chunk somewhere in the top 50. It is fast but coarse. A second-stage re-ranker then reorders that shortlist for precision, surfacing the truly relevant chunks to the top.
This retrieve-broadly-then-rerank-precisely pattern is the backbone of advanced RAG.
def two_stage(query, k_retrieve=50, k_final=5):
candidates = first_stage_retrieve(query, k_retrieve) # high recall
reranked = rerank(query, candidates) # high precision
return reranked[:k_final]Bi-Encoder vs Cross-Encoder
A bi-encoder encodes query and document separately into vectors and compares by cosine; fast and indexable but loses query-document interaction. A cross-encoder feeds the query and a candidate together through the model and outputs a relevance score, capturing fine-grained interaction.
Cross-encoders are far more accurate but cannot be precomputed, so they only run on the shortlist.
# Bi-encoder: score = cos(enc(q), enc(d)) -> precomputable
# Cross-encoder: score = model(q, d) -> 0..1 -> per-pair, no index
def cross_encode(query, doc):
return cross_encoder.predict([(query, doc)])[0] # joint attentionAll lessons in this course
- Beyond Naive RAG
- Re-ranking Retrieved Chunks
- Context Compression
- Query Rewriting and HyDE