AI Prompt Engineering · Lesson

Chunking Strategies for Long Texts

Fixed-size, sentence-boundary, and semantic chunking approaches.

Why Long Documents Are a Challenge

LLMs have a context window — a maximum number of tokens they can process at once. GPT-4 supports 128k tokens, Claude supports 200k, but many documents exceed even these limits. More importantly, research shows model accuracy often degrades on very long contexts. Chunking is the practice of splitting documents into smaller pieces before processing.

Fixed-Size Chunking

Fixed-size chunking splits text every N tokens (or characters) regardless of content boundaries. It is the simplest strategy and requires no semantic understanding.

Typical chunk size: 500–1000 tokens. Chunks are easy to index and retrieve but may split sentences mid-way, losing meaning at boundaries.

import tiktoken

def fixed_chunk(text, max_tokens=1000):
    enc = tiktoken.get_encoding('cl100k_base')
    tokens = enc.encode(text)
    chunks = []
    for i in range(0, len(tokens), max_tokens):
        chunk_tokens = tokens[i:i + max_tokens]
        chunks.append(enc.decode(chunk_tokens))
    return chunks

chunks = fixed_chunk(long_document)
print(f'Total chunks: {len(chunks)}')

All lessons in this course

← Back to AI Prompt Engineering