AI Engineering Academy · Lesson

Semantic Caching with Embeddings

Build a semantic cache that retrieves stored responses for semantically similar but not identical queries by comparing query embeddings to a cache of previous request embeddings.

The Limitation of Exact Caching

Exact caching only helps when users send byte-for-byte identical requests. In reality, users phrase the same question differently: 'How do I cancel my subscription?', 'What is the process to unsubscribe?', and 'Can I stop my plan?' all intend the same question but produce different cache keys. Exact caching misses all these variants. Semantic caching solves this by matching similar queries instead of identical ones, dramatically increasing cache hit rates.

How Semantic Caching Works

A semantic cache stores the embedding of each cached query alongside the cached response. When a new query arrives, embed it and search the cache for a previously seen query with high cosine similarity. If the closest cached query is above a similarity threshold (typically 0.95+), return its cached response. If no match is found, call the LLM, store the new response, and add the new query's embedding to the cache index for future lookups.

# Semantic cache flow
# 1. New query arrives: 'How do I cancel my subscription?'
# 2. Embed it: embed_query = embed('How do I cancel my subscription?')
# 3. Search cache index for nearest cached query embedding
# 4. Find cached: 'What is the process to unsubscribe?' (similarity=0.97)
# 5. 0.97 >= threshold (0.95) → cache HIT, return cached response
# 6. If 0.82 < threshold → cache MISS, call LLM, cache result, add embedding to index

All lessons in this course

Exact Caching with Redis
Semantic Caching with Embeddings
OpenAI Prompt Prefix Caching
Batching, Model Routing, and Cost Dashboards

← Back to AI Engineering Academy