Dense vs Sparse Retrieval: Trade-offs
Understand when dense embeddings miss exact keyword matches and when BM25 misses semantic paraphrases, and why combining both consistently outperforms either alone.
Two Fundamentally Different Retrieval Signals
Modern retrieval systems rely on two distinct signals: dense retrieval encodes meaning into continuous vector spaces, while sparse retrieval counts exact term occurrences. These signals are complementary, not interchangeable. Understanding their individual strengths and weaknesses is the first step toward building a system that uses both effectively.
How Dense Embeddings Work
Dense retrieval maps both the query and each document into a high-dimensional vector using a neural encoder. Similarity is measured by cosine distance or dot product between vectors. Because the encoder was trained on large text corpora, semantically related phrases end up near each other in vector space even if they share no common words — this is the key advantage of dense retrieval.
from openai import OpenAI
import numpy as np
client = OpenAI()
def embed(text: str) -> list[float]:
resp = client.embeddings.create(
model='text-embedding-3-small',
input=text,
)
return resp.data[0].embedding
def cosine_similarity(a, b):
a, b = np.array(a), np.array(b)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
q = embed('How do I cancel my subscription?')
d = embed('Steps to unsubscribe from the service')
print(cosine_similarity(q, d)) # high similarity despite different wordsAll lessons in this course
- Dense vs Sparse Retrieval: Trade-offs
- Implementing BM25 Keyword Search
- Reciprocal Rank Fusion for Score Merging
- Hybrid Search in Pinecone and pgvector