NewFresh guides on DevOps, AI, cloud and security — read the latest
SRE & Reliability
SRE & Reliability

Error budgets your engineers won't quietly resent

SLOs fail when they're handed down from on high. Here's how to set error budgets the team actually believes in — and uses.

A reliability operations dashboard with service health paths and burn-rate gauges.

Error budgets are simple math and hard politics. The math: if your SLO is 99.9%, you have 0.1% to spend on failure. The politics: who decides the number, and what happens when it runs out. Get the politics wrong and the whole thing becomes theater.

Start from user pain, not uptime

A good SLO measures something a user would actually complain about. "The API is up" is useless if every request takes nine seconds. Measure the experience:

  • Availability — fraction of requests that succeed.
  • Latency — fraction served under a threshold users notice.
SLO: 99.5% of /checkout requests succeed in < 800ms over 28 days
Budget: 0.5% → ~3.6 hours of "fast enough" failures per 28 days

Let the budget drive decisions

The point of a budget is to make a tradeoff explicit:

Budget left over? Ship faster, take more risk. Budget burned? Freeze features and fix reliability. The number decides, not the loudest person in the incident review.

Set them with the team, not for the team

The fastest way to kill an SLO is to impose it. Run a short workshop: pick the one or two journeys that matter, agree on thresholds, and write down what the team will do when the budget burns. Ownership of the number is what makes people honor it.

Review monthly, adjust without shame

Your first SLO will be wrong — too tight or too loose. That's expected. Review the burn each month and move the threshold. An SLO you never revisit is just a dashboard nobody trusts.

Share
All articles