SLOs, alerting & on-call

Dashboards without SLOs are decoration; alerts without runbooks are noise. This guide turns Prometheus metrics into an SLI, protects it with a 99.9% SLO, pages on burn-rate symptoms, and connects remaining error budget to deploy gates—so on-call wakes up for customers, not CPU graphs.

Prerequisites: Observability explained (SLO vocabulary) and RED metrics exporting from your API.

After reading, you should be able to:

Define an availability SLI as a Prometheus recording rule.
Configure multi-window burn-rate alert rules.
Route pages vs tickets in Alertmanager.
Write a one-page runbook linked from every alert.
Describe when to freeze releases based on error budget.

Error budget remaining over 30 days with fast and slow burn-rate alert windows. — Burn-rate alerts detect budget consumption early—fast window for outages, slow window for leaks.

Step 1 — Pick one SLI to start

For an HTTP API, a practical availability SLI:

Proportion of valid requests that succeed (non-5xx) and complete under 500ms.

Write the SLI in PromQL first—if you cannot query it, you cannot SLO it.

Step 2 — Recording rules (SLI + error ratio)

Extend rules from the metrics guide:

# rules/slo-checkout.yaml
groups:
  - name: slo_checkout
    interval: 30s
    rules:
      - record: sli:checkout:requests_total:rate5m
        expr: sum(rate(http_requests_total{service="checkout-api"}[5m]))

      - record: sli:checkout:errors_total:rate5m
        expr: sum(rate(http_requests_total{service="checkout-api",status=~"5.."}[5m]))

      - record: sli:checkout:availability:ratio5m
        expr: |
          (
            sli:checkout:requests_total:rate5m
            - sli:checkout:errors_total:rate5m
          )
          / sli:checkout:requests_total:rate5m

      - record: sli:checkout:error_budget:remaining30d
        expr: |
          1 - (
            (1 - avg_over_time(sli:checkout:availability:ratio5m[30d]))
            / (1 - 0.999)
          )

0.999 is the SLO target (99.9% over 30 days). Adjust window and target per product—document them in a service README.

Step 3 — Error budget policy (human agreement)

Budget remaining	Engineering action
> 50%	Normal feature velocity
25–50%	Extra caution; canary required; fewer risky changes
< 25%	Freeze non-critical releases; reliability work prioritized
Exhausted (SLO breach)	Postmortem; no major launches until recovery plan executed

Share this table with product—SLOs are a negotiation tool, not only ops trivia.

Step 4 — Burn-rate alerts (symptom-based)

Google SRE multi-window approach: page when budget burns too fast in a short window, ticket when it smolders in a long window.

# rules/alerts-slo-checkout.yaml
groups:
  - name: slo_burn_checkout
    rules:
      # Fast burn — ~2% budget in 1h → page (example factors for 99.9% / 30d)
      - alert: CheckoutSLOFastBurn
        expr: |
          (
            sli:checkout:errors_total:rate5m
            / sli:checkout:requests_total:rate5m
          ) > (14.4 * 0.001)
        for: 2m
        labels:
          severity: page
          service: checkout-api
        annotations:
          summary: "Checkout SLO burning fast (5m error rate)"
          runbook: "https://wiki.example/runbooks/checkout-slo"
          dashboard: "https://grafana.example/d/checkout-red"

      # Slow burn — ~5% budget in 6h → ticket
      - alert: CheckoutSLOSlowBurn
        expr: |
          (
            sli:checkout:errors_total:rate5m
            / sli:checkout:requests_total:rate5m
          ) > (6 * 0.001)
        for: 15m
        labels:
          severity: ticket
          service: checkout-api
        annotations:
          summary: "Checkout SLO elevated error rate (sustained)"
          runbook: "https://wiki.example/runbooks/checkout-slo"

Factors (14.4, 6) depend on SLO window and target—use an SLO calculator or Sloth/Alertmanager SLO generators in production; the pattern matters more than memorizing constants.

Step 5 — Alertmanager routing

# alertmanager.yml (simplified)
route:
  receiver: slack-info
  routes:
    - matchers:
        - severity="page"
      receiver: pagerduty-primary
      continue: true
    - matchers:
        - severity="ticket"
      receiver: slack-oncall

receivers:
  - name: pagerduty-primary
    pagerduty_configs:
      - routing_key: ${PAGERDUTY_KEY}
  - name: slack-oncall
    slack_configs:
      - channel: "#oncall"
  - name: slack-info
    slack_configs:
      - channel: "#eng-alerts"

Every page alert includes runbook and dashboard annotations—on-call should never grep Slack for “what do I do?”

Step 6 — Runbook template (one page)

# Runbook: Checkout SLO burn

## Impact
Customers cannot complete checkout; revenue at risk.

## First 5 minutes
1. Open Grafana RED dashboard (link in alert).
2. Check recent deploys — GitHub Actions / Argo CD last 2h.
3. Loki: `{app="checkout-api"} | json | level="error"` last 15m.
4. Tempo: slow traces > 2s for service checkout-api.

## Mitigations
- Roll back deployment: `kubectl rollout undo deploy/checkout-api -n prod`
- Scale: `kubectl scale deploy/checkout-api -n prod --replicas=8`

## Escalation
- #checkout-oncall Slack → payments platform lead if > 30m

## Post-incident
- Postmortem if budget burn > 10%; file ticket for missing test.

Step 7 — On-call expectations

Primary acknowledges pages within 5 minutes; investigates using runbook.
Secondary backs up if primary does not ack in 10 minutes.
Business hours — severity=ticket to Slack only; no wake-up.
Handoff — log open incidents in status doc at shift change.

Rotation tools: PagerDuty, Opsgenie, Grafana OnCall—process is the same everywhere.

Step 8 — Tie SLO to CI/CD gates

After deploy to staging, automated checks before prod (environment gates):

Smoke test HTTP 200 on /health.
Query Prometheus: sli:checkout:availability:ratio5m > 0.995 for 10 minutes (canary period).
Block prod GitHub Environment approval if error budget recording rule < 25% remaining.

# GitHub Actions snippet (conceptual)
- name: Check error budget
  run: |
    BUDGET=$(curl -sG http://prometheus/api/v1/query \
      --data-urlencode 'query=sli:checkout:error_budget:remaining30d' \
      | jq '.data.result[0].value[1]')
    awk -v b="$BUDGET" 'BEGIN { exit !(b > 0.25) }'

Step 9 — Incident workflow (metrics → logs → traces)

Page fires on burn rate—open dashboard from annotation.
Confirm user impact (support tickets, synthetic check).
Logs with trace_id from error sample (logs guide).
Trace waterfall for slow/failed path (tracing guide).
Mitigate, then postmortem if SLO budget took significant hit.

Step 10 — What not to page on

Single pod restart with passing readiness.
CPU > 70% with stable latency and error rate.
Certificate expiry < 30 days (ticket reminder instead).
Any alert without owner team label.

Step 11 — Troubleshooting alerting

Symptom	Fix
Alert never fires	Rule not loaded; `for:` too long; PromQL divides by zero
Flapping page	Increase `for:`; fix underlying deploy; tune burn factor
Page storm	`group_by` in Alertmanager; inhibit rules (node down inhibits pod alerts)
SLO always green but users angry	Wrong SLI—measure success path, not /health only

Step 12 — Anti-patterns

100+ alerts per service with no severity model.
SLO target 100%—impossible error budget, teams ignore it.
Runbooks that only say “check Grafana.”
Paging on infrastructure cause; not user symptom.
Skipping postmortem when budget is burned—repeat incident next month.

Interview phrase: “We define an availability SLI in Prometheus, set a 99.9% thirty-day SLO, page on multi-window burn rates with runbook links, route through Alertmanager to PagerDuty, and slow feature work when error budget drops below twenty-five percent.”

The one line to remember

SLOs turn metrics into policy—burn-rate alerts wake the right person, runbooks tell them what to do, error budget tells product when to stop shipping.

Observability track — complete

Full path: Explained → Metrics → Logs → Tracing → this guide. The DevOps hub tracks are now complete from Docker through observability.