Exact Caching with Redis
Cache LLM responses by hashing the complete prompt and storing the result in Redis with a TTL, serving identical requests instantly without any API call.
Why Cache LLM Responses?
LLM API calls are expensive: a single GPT-4o request can cost $0.005-$0.15 depending on token count. In many applications, a significant fraction of incoming queries are identical or near-identical to previous ones — think FAQ bots, customer support systems, or code review tools where users ask the same questions repeatedly. Caching can eliminate 20-50 percent of API calls in these use cases, directly cutting costs and reducing latency.
Exact Cache: Cache Key Design
An exact cache stores LLM responses keyed by a deterministic hash of the input. The cache key must capture every input that affects the output: the messages array, the model name, temperature, and any other parameters that change the response. Missing any of these from the key causes cache collisions where a cached response is served for a different effective request.
import hashlib
import json
def make_cache_key(messages: list[dict], model: str, temperature: float) -> str:
# Create a canonical, order-stable representation
key_data = {
'model': model,
'temperature': temperature,
'messages': messages, # list order matters
}
# Serialize to JSON with sorted keys for determinism
serialized = json.dumps(key_data, sort_keys=True, ensure_ascii=False)
# Hash to a fixed-length key safe for Redis
return 'llm_cache:' + hashlib.sha256(serialized.encode()).hexdigest()