Caching Strategies for Prompts
Semantic caching, exact-match caching, and Anthropic prompt caching.
Why Cache Prompt Results?
LLM API calls are expensive and slow. Many production applications send the same (or very similar) prompts repeatedly. Caching returns stored results for repeated queries, eliminating redundant API calls and reducing both cost and latency dramatically.
Exact-Match Caching with Hash Keys
The simplest cache: hash the exact prompt string and store the result. If the same prompt string appears again, return the cached result without calling the API.
import hashlib
import json
from functools import lru_cache
class ExactMatchCache:
def __init__(self, backend=None):
# backend: a dict (in-memory) or Redis client
self.store = backend or {}
def _key(self, messages, model, max_tokens):
content = json.dumps({'messages': messages, 'model': model,
'max_tokens': max_tokens}, sort_keys=True)
return 'llm:' + hashlib.sha256(content.encode()).hexdigest()
def get(self, messages, model, max_tokens):
key = self._key(messages, model, max_tokens)
return self.store.get(key)
def set(self, messages, model, max_tokens, result, ttl_seconds=3600):
key = self._key(messages, model, max_tokens)
self.store[key] = result
# In Redis: self.store.setex(key, ttl_seconds, json.dumps(result))
cache = ExactMatchCache()
# Usage
messages = [{'role': 'user', 'content': 'What is the capital of France?'}]
cached = cache.get(messages, 'gpt-4o-mini', 100)
if cached:
print('Cache HIT:', cached[:50])
else:
print('Cache MISS — calling API...')All lessons in this course
- Caching Strategies for Prompts
- Batch Processing and Async Execution
- Load Balancing Across Models
- Monitoring and Alerting for Prompt Pipelines