AI Prompt Engineering · Lesson

Caching Long Prefixes

Cost control with prompt caching.

Why Prefix Caching Exists

Every long prompt pays prefill cost to encode its tokens before generating. When the same long prefix repeats across many calls — a system prompt, a tool spec, a large reference document — prompt caching lets the provider reuse the already-computed attention state instead of recomputing it.

Cache hits cut latency and input cost dramatically.
The savings scale with prefix size and reuse frequency.

Caching Is Prefix-Anchored

Caches key on an exact token-prefix match from the start of the prompt. The cached span runs from the beginning up to the first point of divergence. Change anything early and everything after it misses the cache.

Implication: the layout that maximizes hits puts the most stable content first and the most variable content last.

# Cache reuse covers: [identical prefix .... first difference)
# One early edit invalidates the whole downstream cache.

All lessons in this course

← Back to AI Prompt Engineering