Alerting on SLOs and Error Budgets
Define service-level objectives and wire actionable alerts that fire before users notice degradation.
Why Alert on SLOs, Not Raw Metrics
Traditional alerts fire on raw symptoms like CPU > 90% or error_count > 100. The problem: these page you for things users never notice, and stay silent during slow degradation that hurts customers.
SLO-based alerting inverts this. You first define what "good service" means to a user, then alert only when you are at risk of breaking that promise.
- SLI (Service Level Indicator): a measured ratio, e.g. fraction of fast, successful requests.
- SLO (Service Level Objective): the target for that SLI, e.g. 99.9% over 30 days.
- Error budget: the allowed failure, i.e. 100% minus the SLO.
In this lesson you will define SLOs for a FastAPI service and wire alerts that fire before users notice degradation.
Picking a Good SLI for an API
A good SLI is a ratio of good events to valid events, scaled 0 to 100%. For a FastAPI backend the two workhorse SLIs are:
- Availability: successful responses / all valid responses. Treat 5xx as failures; usually exclude 4xx (client's fault).
- Latency: requests served under a threshold / all requests, e.g. responses faster than 300ms.
Below is a tiny, self-contained calculator that turns raw request logs into these two SLIs.
def compute_slis(requests, latency_threshold_ms=300):
valid = [r for r in requests if r["status"] < 500 or r["status"] >= 500]
total = len(valid)
good_avail = sum(1 for r in valid if r["status"] < 500)
fast = sum(1 for r in valid if r["latency_ms"] <= latency_threshold_ms)
availability = good_avail / total
latency_sli = fast / total
return {"availability": availability, "latency": latency_sli}
sample = [
{"status": 200, "latency_ms": 120},
{"status": 200, "latency_ms": 410},
{"status": 500, "latency_ms": 90},
{"status": 200, "latency_ms": 250},
{"status": 503, "latency_ms": 600},
]
slis = compute_slis(sample)
print(f"availability = {slis['availability']:.2%}")
print(f"latency = {slis['latency']:.2%}")All lessons in this course
- Structured JSON Logging and Correlation IDs
- Distributed Tracing with OpenTelemetry
- Prometheus Metrics and RED/USE Dashboards
- Alerting on SLOs and Error Budgets