AI Engineering Academy · Lesson

OpenAI Prompt Prefix Caching

Leverage OpenAI's automatic prompt caching that discounts repeated long system prompt prefixes at 50 percent off, and structure your prompts to maximize cache hit rates.

What Is Prompt Prefix Caching?

Prompt prefix caching is a server-side optimization built into OpenAI's API that automatically discounts tokens in the prompt prefix that were seen in a recent prior request. Unlike application-level caching that returns a stored response, prompt prefix caching still calls the model — but at a 50 percent reduced input token price for the cached prefix portion. It reduces cost without sacrificing fresh generation.

How Prefix Caching Works Under the Hood

Modern LLMs represent prompts as KV (key-value) caches in GPU memory. Processing a prompt means computing attention keys and values for every token. If the first N tokens of two consecutive requests are identical, OpenAI can reuse the KV cache from the first request, skipping the expensive compute for those tokens. The API does this automatically and transparently — you just pay the lower cached token rate when it applies.

# No code changes needed to enable prefix caching!
# It is automatic on supported models.

# The API response shows you how many tokens were cached:
# response.usage.prompt_tokens_details.cached_tokens

# Example response usage:
# ChatCompletionUsage(
#   prompt_tokens=2048,
#   completion_tokens=256,
#   total_tokens=2304,
#   prompt_tokens_details=PromptTokensDetails(
#     cached_tokens=1984,   # these tokens were served from KV cache
#     audio_tokens=0,
#   )
# )

All lessons in this course

Exact Caching with Redis
Semantic Caching with Embeddings
OpenAI Prompt Prefix Caching
Batching, Model Routing, and Cost Dashboards

← Back to AI Engineering Academy