LLM Apps in Production (RAG + Vector DB + Caching) · Lesson

Choosing the Right Model for the Task

Learn how to cut RAG costs and latency by routing each request to the cheapest model that can do the job well, using model tiers, cascades, and quality gates.

Why Model Choice Drives Cost

In a RAG pipeline the LLM call is usually the single biggest cost and latency driver. The same prompt sent to a flagship model can cost 20-50x more than a small model.

Optimizing model selection is often the highest-leverage change you can make.

Token price differs per model
Latency scales with model size
Not every query needs the biggest brain

Model Tiers

Group your available models into tiers by capability and price:

Small / cheap — classification, extraction, simple Q&A
Mid — most RAG answers grounded in retrieved context
Large / flagship — multi-step reasoning, ambiguous queries

Default to the smallest tier that meets your quality bar.

All lessons in this course

← Back to LLM Apps in Production (RAG + Vector DB + Caching)