AI Prompt Engineering · Lesson

Caching Strategies for Prompts

Semantic caching, exact-match caching, and Anthropic prompt caching.

Why Cache Prompt Results?

LLM API calls are expensive and slow. Many production applications send the same (or very similar) prompts repeatedly. Caching returns stored results for repeated queries, eliminating redundant API calls and reducing both cost and latency dramatically.

Exact-Match Caching with Hash Keys

The simplest cache: hash the exact prompt string and store the result. If the same prompt string appears again, return the cached result without calling the API.

import hashlib
import json
from functools import lru_cache

class ExactMatchCache:
    def __init__(self, backend=None):
        # backend: a dict (in-memory) or Redis client
        self.store = backend or {}

    def _key(self, messages, model, max_tokens):
        content = json.dumps({'messages': messages, 'model': model,
                               'max_tokens': max_tokens}, sort_keys=True)
        return 'llm:' + hashlib.sha256(content.encode()).hexdigest()

    def get(self, messages, model, max_tokens):
        key = self._key(messages, model, max_tokens)
        return self.store.get(key)

    def set(self, messages, model, max_tokens, result, ttl_seconds=3600):
        key = self._key(messages, model, max_tokens)
        self.store[key] = result
        # In Redis: self.store.setex(key, ttl_seconds, json.dumps(result))

cache = ExactMatchCache()

# Usage
messages = [{'role': 'user', 'content': 'What is the capital of France?'}]
cached = cache.get(messages, 'gpt-4o-mini', 100)
if cached:
    print('Cache HIT:', cached[:50])
else:
    print('Cache MISS — calling API...')

All lessons in this course

Caching Strategies for Prompts
Batch Processing and Async Execution
Load Balancing Across Models
Monitoring and Alerting for Prompt Pipelines

← Back to AI Prompt Engineering