Choosing the Right Model for the Task
Learn how to cut RAG costs and latency by routing each request to the cheapest model that can do the job well, using model tiers, cascades, and quality gates.
Why Model Choice Drives Cost
In a RAG pipeline the LLM call is usually the single biggest cost and latency driver. The same prompt sent to a flagship model can cost 20-50x more than a small model.
Optimizing model selection is often the highest-leverage change you can make.
- Token price differs per model
- Latency scales with model size
- Not every query needs the biggest brain
Model Tiers
Group your available models into tiers by capability and price:
- Small / cheap — classification, extraction, simple Q&A
- Mid — most RAG answers grounded in retrieved context
- Large / flagship — multi-step reasoning, ambiguous queries
Default to the smallest tier that meets your quality bar.
All lessons in this course
- Prompt Engineering for Efficiency
- Batching and Asynchronous Operations
- Monitoring Costs and Latency
- Choosing the Right Model for the Task