Context Compression
Trimming context to what matters.
Why Compress Context
Retrieved chunks are noisy: a relevant chunk may be mostly boilerplate with one load-bearing sentence. Context compression trims retrieved text down to what actually answers the query before it reaches the generator.
Benefits compound: lower token cost, reduced latency, fewer distractors, and relief from lost-in-the-middle by shrinking the context the model must traverse.
def compress(query, chunks):
# Goal: keep only spans that bear on the query,
# dropping boilerplate, navigation, and off-topic sentences.
return [extract_relevant(query, c) for c in chunks]Extractive vs Abstractive
Extractive compression selects verbatim spans (sentences, passages) relevant to the query, preserving exact wording and provenance. Abstractive compression paraphrases or summarizes, achieving higher compression but risking information loss and introduced errors.
For factual RAG with citations, prefer extractive to keep claims traceable to source; use abstractive only when faithfulness can be verified.
def extractive(query, chunk):
sents = split_sentences(chunk.text)
scored = [(s, relevance(query, s)) for s in sents]
kept = [s for s, r in scored if r > TAU]
return ' '.join(kept) or top1(scored) # never return emptyAll lessons in this course
- Beyond Naive RAG
- Re-ranking Retrieved Chunks
- Context Compression
- Query Rewriting and HyDE