NestJS Enterprise Backend APIs · Lesson

Defining SLOs and Error Budgets

Translate latency and error metrics into measurable service-level objectives and alerting.

From Metrics to Promises

Your NestJS API already emits latency and error metrics. But raw numbers like p99 = 412ms mean nothing without a target. An SLO (Service-Level Objective) turns a metric into a promise: "99.9% of requests succeed within 300ms over a rolling 28 days."

SLI — the Service-Level Indicator: the actual measured quantity (e.g. fraction of good requests).
SLO — the target you commit to for that SLI (e.g. 99.9%).
SLA — the contractual consequence if you miss the SLO (refunds, credits).

In this lesson you translate your existing latency and error metrics into SLOs, derive an error budget, and wire up budget-burn alerting.

Defining a Good Event

Every SLI is a ratio of good events / valid events. The hard part is defining "good" precisely. For an HTTP API, a request is usually valid if it reaches your handler (exclude 404s for unknown routes and client-cancelled requests), and good if it is both fast enough and not a server error.

Availability SLI: good = status code not in 5xx.
Latency SLI: good = served under a threshold (e.g. 300ms).

Note that 4xx responses are normally not failures of your service — a 400 Bad Request means the client sent bad input. Counting them against your budget would punish you for client mistakes.

type RequestOutcome = {
  statusCode: number;
  latencyMs: number;
};

const LATENCY_THRESHOLD_MS = 300;

function isValid(o: RequestOutcome): boolean {
  // Exclude client errors from the denominator; they are not our fault.
  return o.statusCode < 400 || o.statusCode >= 500;
}

function isGood(o: RequestOutcome): boolean {
  const serverError = o.statusCode >= 500;
  const tooSlow = o.latencyMs > LATENCY_THRESHOLD_MS;
  return !serverError && !tooSlow;
}

const sample: RequestOutcome = { statusCode: 200, latencyMs: 142 };
console.log('valid:', isValid(sample), 'good:', isGood(sample));

All lessons in this course

Timeouts, Retries, and Bulkheads with Interceptors
Circuit Breakers for Downstream Failures
Distributed Tracing with OpenTelemetry
Defining SLOs and Error Budgets

← Back to NestJS Enterprise Backend APIs