Alerting on Latency, Cost, and Quality Degradation
Define alert thresholds on p99 latency, per-request cost, and automated quality scores, and route alerts to Slack or PagerDuty when your LLM pipeline degrades.
Why Alerting Matters for LLM Systems
LLM applications fail in ways that are subtle and gradual. A prompt change might increase average latency by 30%, a retrieval upgrade might slightly reduce answer quality, or a surge in usage might push daily costs 5x above budget. Without proactive alerting, you discover these problems only when users complain or the monthly bill arrives. Alerts transform reactive fire-fighting into proactive operations.
The Three Alert Categories
LLM application alerts fall into three categories. Latency alerts fire when response time exceeds a user experience threshold (e.g., p99 > 10 seconds). Cost alerts fire when per-request cost or daily spend exceeds budget thresholds, preventing bill shock. Quality alerts fire when automated quality metrics (LLM-as-judge scores, user satisfaction rates, faithfulness scores) drop below an acceptable floor. Each category requires different instrumentation and different alert channels.
from dataclasses import dataclass
@dataclass
class AlertThresholds:
# Latency (milliseconds)
p50_latency_ms: int = 2000 # median should be under 2s
p99_latency_ms: int = 10000 # 99th percentile under 10s
# Cost (USD)
max_cost_per_request: float = 0.05 # alert if one request costs > 5 cents
max_daily_spend: float = 50.00 # alert if daily spend exceeds $50
# Quality (0.0 to 1.0 scale)
min_quality_score: float = 0.75 # alert if rolling avg drops below 75%
min_faithfulness: float = 0.80 # alert if RAGAS faithfulness drops below 80%
DEFAULT_THRESHOLDS = AlertThresholds()All lessons in this course
- Why LLM Apps Are Hard to Debug
- Tracing with LangSmith
- Langfuse for Model-Agnostic Observability
- Alerting on Latency, Cost, and Quality Degradation