Beyond Naive RAG
Limitations of basic retrieval.
What Naive RAG Does
Naive RAG is the baseline: chunk documents, embed them, store vectors, embed the query, retrieve top-k by cosine similarity, stuff the chunks into the prompt, and generate. It is a strong starting point and fails in predictable ways at scale.
Understanding those failure modes is the prerequisite for the advanced techniques (re-ranking, compression, query rewriting) covered in this course.
def naive_rag(query, k=5):
q = embed(query)
chunks = vector_store.search(q, k) # top-k by cosine
context = '\n\n'.join(c.text for c in chunks)
return llm('Context:\n' + context + '\n\nQ: ' + query)Retrieval Recall vs Precision
Naive top-k optimizes raw vector similarity, which conflates relevance with surface semantic closeness. You face a tension: a small k risks missing the answer (low recall); a large k floods the context with distractors (low precision).
The embedding similarity that drives retrieval is a coarse proxy for true relevance, and it is the root of several downstream problems.
# The core dilemma
# small k -> may miss the gold chunk (recall problem)
# large k -> distractors crowd context (precision + cost problem)
# Advanced RAG decouples 'retrieve many' from 'use few'