AI Engineering Academy · Lesson

Chunking Strategies: Fixed vs Sentence vs Recursive

Implement and compare fixed-size, sentence-boundary, and recursive character text splitters, and understand how chunk size and overlap affect retrieval quality.

Why Chunking Quality Matters

Chunking is the process of splitting loaded documents into smaller pieces that fit within the LLM's context window and can be individually indexed and retrieved. The way you chunk determines retrieval quality more than almost any other factor. A chunk that splits an answer across two pieces means neither chunk alone is sufficient to answer the question. A chunk that mixes two unrelated topics gets retrieved for questions about both but is useful for neither.

Fixed-Size Chunking

Fixed-size chunking splits text into chunks of exactly N characters or N tokens, regardless of sentence or paragraph boundaries. It is the simplest strategy and fast to implement. The major drawback is that it often cuts sentences in half, creating chunks that start or end mid-thought. This is acceptable for dense, uniformly formatted text like database export dumps, but produces poor retrieval quality on narrative prose or technical documentation.

def fixed_size_chunks(text, chunk_size=500, overlap=50):
    '''Split text into fixed-size character chunks with overlap'''
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start += chunk_size - overlap  # overlap keeps context at boundaries
    return chunks

example = 'This is a long document. ' * 100
chunks = fixed_size_chunks(example, chunk_size=200, overlap=20)
print(f'Produced {len(chunks)} chunks, first: {chunks[0][:80]}...')

All lessons in this course

Document Loading and Text Extraction
Chunking Strategies: Fixed vs Sentence vs Recursive
Indexing: Embedding and Storing Chunks
Query, Retrieve, and Generate

← Back to AI Engineering Academy