0PricingLogin
AI Engineering Academy · Lesson

Exact Caching with Redis

Cache LLM responses by hashing the complete prompt and storing the result in Redis with a TTL, serving identical requests instantly without any API call.

Why Cache LLM Responses?

LLM API calls are expensive: a single GPT-4o request can cost $0.005-$0.15 depending on token count. In many applications, a significant fraction of incoming queries are identical or near-identical to previous ones — think FAQ bots, customer support systems, or code review tools where users ask the same questions repeatedly. Caching can eliminate 20-50 percent of API calls in these use cases, directly cutting costs and reducing latency.

Exact Cache: Cache Key Design

An exact cache stores LLM responses keyed by a deterministic hash of the input. The cache key must capture every input that affects the output: the messages array, the model name, temperature, and any other parameters that change the response. Missing any of these from the key causes cache collisions where a cached response is served for a different effective request.

import hashlib
import json

def make_cache_key(messages: list[dict], model: str, temperature: float) -> str:
    # Create a canonical, order-stable representation
    key_data = {
        'model': model,
        'temperature': temperature,
        'messages': messages,  # list order matters
    }
    # Serialize to JSON with sorted keys for determinism
    serialized = json.dumps(key_data, sort_keys=True, ensure_ascii=False)
    # Hash to a fixed-length key safe for Redis
    return 'llm_cache:' + hashlib.sha256(serialized.encode()).hexdigest()

All lessons in this course

  1. Exact Caching with Redis
  2. Semantic Caching with Embeddings
  3. OpenAI Prompt Prefix Caching
  4. Batching, Model Routing, and Cost Dashboards
← Back to AI Engineering Academy