Chunking Strategies for Long Texts
Fixed-size, sentence-boundary, and semantic chunking approaches.
Why Long Documents Are a Challenge
LLMs have a context window — a maximum number of tokens they can process at once. GPT-4 supports 128k tokens, Claude supports 200k, but many documents exceed even these limits. More importantly, research shows model accuracy often degrades on very long contexts. Chunking is the practice of splitting documents into smaller pieces before processing.
Fixed-Size Chunking
Fixed-size chunking splits text every N tokens (or characters) regardless of content boundaries. It is the simplest strategy and requires no semantic understanding.
Typical chunk size: 500–1000 tokens. Chunks are easy to index and retrieve but may split sentences mid-way, losing meaning at boundaries.
import tiktoken
def fixed_chunk(text, max_tokens=1000):
enc = tiktoken.get_encoding('cl100k_base')
tokens = enc.encode(text)
chunks = []
for i in range(0, len(tokens), max_tokens):
chunk_tokens = tokens[i:i + max_tokens]
chunks.append(enc.decode(chunk_tokens))
return chunks
chunks = fixed_chunk(long_document)
print(f'Total chunks: {len(chunks)}')All lessons in this course
- Chunking Strategies for Long Texts
- Map-Reduce Summarization Pattern
- Hierarchical Summarization
- Maintaining Context Across Chunks