0Pricing
FastAPI Backend Development Bootcamp · Lesson

Alerting on SLOs and Error Budgets

Define service-level objectives and wire actionable alerts that fire before users notice degradation.

Why Alert on SLOs, Not Raw Metrics

Traditional alerts fire on raw symptoms like CPU > 90% or error_count > 100. The problem: these page you for things users never notice, and stay silent during slow degradation that hurts customers.

SLO-based alerting inverts this. You first define what "good service" means to a user, then alert only when you are at risk of breaking that promise.

  • SLI (Service Level Indicator): a measured ratio, e.g. fraction of fast, successful requests.
  • SLO (Service Level Objective): the target for that SLI, e.g. 99.9% over 30 days.
  • Error budget: the allowed failure, i.e. 100% minus the SLO.

In this lesson you will define SLOs for a FastAPI service and wire alerts that fire before users notice degradation.

Picking a Good SLI for an API

A good SLI is a ratio of good events to valid events, scaled 0 to 100%. For a FastAPI backend the two workhorse SLIs are:

  • Availability: successful responses / all valid responses. Treat 5xx as failures; usually exclude 4xx (client's fault).
  • Latency: requests served under a threshold / all requests, e.g. responses faster than 300ms.

Below is a tiny, self-contained calculator that turns raw request logs into these two SLIs.

def compute_slis(requests, latency_threshold_ms=300):
    valid = [r for r in requests if r["status"] < 500 or r["status"] >= 500]
    total = len(valid)
    good_avail = sum(1 for r in valid if r["status"] < 500)
    fast = sum(1 for r in valid if r["latency_ms"] <= latency_threshold_ms)
    availability = good_avail / total
    latency_sli = fast / total
    return {"availability": availability, "latency": latency_sli}


sample = [
    {"status": 200, "latency_ms": 120},
    {"status": 200, "latency_ms": 410},
    {"status": 500, "latency_ms": 90},
    {"status": 200, "latency_ms": 250},
    {"status": 503, "latency_ms": 600},
]

slis = compute_slis(sample)
print(f"availability = {slis['availability']:.2%}")
print(f"latency      = {slis['latency']:.2%}")

All lessons in this course

  1. Structured JSON Logging and Correlation IDs
  2. Distributed Tracing with OpenTelemetry
  3. Prometheus Metrics and RED/USE Dashboards
  4. Alerting on SLOs and Error Budgets
← Back to FastAPI Backend Development Bootcamp