Context Windows: Size and Implications
Learn what the context window is, how it constrains conversation length and document processing, and compare context sizes across GPT-4o, Claude, and Gemini.
What Is the Context Window?
The context window is the maximum number of tokens an LLM can process in a single API call. It includes everything: the system prompt, all previous conversation turns, any documents you inject for RAG, and the space reserved for the model's response. If the total exceeds the context window, the API returns an error.
Think of the context window as the model's working memory. Unlike a human who can remember past conversations across sessions, an LLM has no persistent memory — it can only 'know' what is present in the current context window. When a conversation grows beyond the window, the oldest content must be removed, which can cause the model to lose track of important earlier context.
Context Window Sizes in 2025
Context windows have grown dramatically. In 2020, GPT-3 offered 4,096 tokens. By 2025, leading models offer:
- GPT-4o and GPT-4o-mini: 128,000 tokens (~100,000 words)
- Claude 3.5 Sonnet / Opus: 200,000 tokens
- Gemini 1.5 Pro: 1,000,000 tokens (one million)
- Gemini 1.5 Flash: 1,000,000 tokens
A 128K context window can hold approximately 300 pages of text, a complete novel, or an entire medium-sized codebase. Despite this, infinite context is not a solved problem: attention cost grows quadratically with sequence length, making very long contexts expensive and sometimes less accurate than shorter focused contexts.
All lessons in this course
- What Is a Token?
- Context Windows: Size and Implications
- Calculating and Predicting API Costs
- Strategies for Staying Within Context