AI Engineering Academy · Lesson

Contextual Compression and Relevance Filtering

Apply contextual compression to strip irrelevant sentences from retrieved chunks before feeding them to the LLM, reducing noise and saving tokens.

The Problem with Noisy Retrieved Chunks

Retrieved chunks often contain mixed relevance content. A 500-token chunk about database indexing might answer the first two sentences of the query but contain six sentences of unrelated material about backup procedures. Sending this entire chunk to the LLM wastes tokens, reduces the signal-to-noise ratio, and can cause the model to generate an answer grounded in the irrelevant portion rather than the relevant sentences.

What Is Contextual Compression?

Contextual compression is a post-retrieval step that takes each retrieved chunk and extracts only the sentences relevant to the query before passing the chunk to the LLM. The original chunk is compressed to its most relevant parts, reducing token usage and improving answer quality. LangChain's ContextualCompressionRetriever wraps any retriever with a compressor component that performs this extraction.

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import ChatOpenAI

# LLMChainExtractor uses an LLM to extract the relevant portion
llm = ChatOpenAI(model='gpt-4o-mini', temperature=0)
compressor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=base_retriever,
)

results = compression_retriever.invoke('how does HNSW indexing work?')
for doc in results:
    print(len(doc.page_content), 'chars:', doc.page_content[:150])

All lessons in this course

Why Two-Stage Retrieval Works
Cross-Encoder Re-ranking with Cohere and BGE
Contextual Compression and Relevance Filtering
Measuring the Impact of Re-ranking

← Back to AI Engineering Academy