Semantic Caching with Embeddings
Build a semantic cache that retrieves stored responses for semantically similar but not identical queries by comparing query embeddings to a cache of previous request embeddings.
The Limitation of Exact Caching
Exact caching only helps when users send byte-for-byte identical requests. In reality, users phrase the same question differently: 'How do I cancel my subscription?', 'What is the process to unsubscribe?', and 'Can I stop my plan?' all intend the same question but produce different cache keys. Exact caching misses all these variants. Semantic caching solves this by matching similar queries instead of identical ones, dramatically increasing cache hit rates.
How Semantic Caching Works
A semantic cache stores the embedding of each cached query alongside the cached response. When a new query arrives, embed it and search the cache for a previously seen query with high cosine similarity. If the closest cached query is above a similarity threshold (typically 0.95+), return its cached response. If no match is found, call the LLM, store the new response, and add the new query's embedding to the cache index for future lookups.
# Semantic cache flow
# 1. New query arrives: 'How do I cancel my subscription?'
# 2. Embed it: embed_query = embed('How do I cancel my subscription?')
# 3. Search cache index for nearest cached query embedding
# 4. Find cached: 'What is the process to unsubscribe?' (similarity=0.97)
# 5. 0.97 >= threshold (0.95) → cache HIT, return cached response
# 6. If 0.82 < threshold → cache MISS, call LLM, cache result, add embedding to indexAll lessons in this course
- Exact Caching with Redis
- Semantic Caching with Embeddings
- OpenAI Prompt Prefix Caching
- Batching, Model Routing, and Cost Dashboards