0Pricing
AI Engineering Academy · Lesson

Timeout Budgets and Graceful Degradation

Set aggressive timeout budgets at each layer of your pipeline, implement graceful degradation that serves cached or simplified responses when the LLM exceeds its budget.

What Is a Timeout Budget?

A timeout budget is a maximum total time allocated for a request to complete across all stages of your pipeline. Instead of setting an arbitrary timeout on each individual API call, you define the end-to-end budget for the user-facing operation and distribute it across retrieval, LLM generation, and post-processing steps. This ensures you always respond within an acceptable time, even if some stages are slow.

Distributing Budget Across Pipeline Stages

A typical RAG chat pipeline has three stages: retrieval, LLM generation, and response formatting. Assign a time slice to each based on how long each normally takes and how much slack users tolerate. The remaining slack is your degradation buffer — if any stage uses its full allocation, you start cutting corners in subsequent stages to stay within the overall budget.

# Total user-facing SLA: 8000ms
BUDGET_TOTAL_MS = 8000

BUDGET_STAGES = {
    'retrieval':   1500,  # vector search + rerank
    'llm_call':    5500,  # token streaming
    'formatting':  500,   # post-processing
    'slack':       500,   # buffer for overhead
}

assert sum(BUDGET_STAGES.values()) == BUDGET_TOTAL_MS

All lessons in this course

  1. Measuring LLM Latency: TTFT and TPOT
  2. Load Balancing and Multi-Key Strategies
  3. Fallback Providers and Circuit Breakers
  4. Timeout Budgets and Graceful Degradation
← Back to AI Engineering Academy