Chunking Strategies: Fixed vs Sentence vs Recursive
Implement and compare fixed-size, sentence-boundary, and recursive character text splitters, and understand how chunk size and overlap affect retrieval quality.
Why Chunking Quality Matters
Chunking is the process of splitting loaded documents into smaller pieces that fit within the LLM's context window and can be individually indexed and retrieved. The way you chunk determines retrieval quality more than almost any other factor. A chunk that splits an answer across two pieces means neither chunk alone is sufficient to answer the question. A chunk that mixes two unrelated topics gets retrieved for questions about both but is useful for neither.
Fixed-Size Chunking
Fixed-size chunking splits text into chunks of exactly N characters or N tokens, regardless of sentence or paragraph boundaries. It is the simplest strategy and fast to implement. The major drawback is that it often cuts sentences in half, creating chunks that start or end mid-thought. This is acceptable for dense, uniformly formatted text like database export dumps, but produces poor retrieval quality on narrative prose or technical documentation.
def fixed_size_chunks(text, chunk_size=500, overlap=50):
'''Split text into fixed-size character chunks with overlap'''
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk = text[start:end]
chunks.append(chunk)
start += chunk_size - overlap # overlap keeps context at boundaries
return chunks
example = 'This is a long document. ' * 100
chunks = fixed_size_chunks(example, chunk_size=200, overlap=20)
print(f'Produced {len(chunks)} chunks, first: {chunks[0][:80]}...')All lessons in this course
- Document Loading and Text Extraction
- Chunking Strategies: Fixed vs Sentence vs Recursive
- Indexing: Embedding and Storing Chunks
- Query, Retrieve, and Generate