Observability explained

When production misbehaves, “the server is down” is not enough—you need to know which dependency failed, for whom, and since when. Observability is the practice of instrumenting systems so you can answer novel questions from external signals: metrics, logs, and traces, grounded in SLOs that reflect user experience.

Helpful background: Kubernetes debugging (kubectl triage) and DevSec Core (deploy gates that can consult health signals).

After reading, you should be able to:

Step 1 — Monitoring vs observability

MonitoringObservability
Known failure modes, predefined dashboards and alertsExplore unknowns—“why is checkout slow only in EU?”
“Is CPU high?”“Which span in the payment service added 800ms?”
Threshold alerts on metrics you already chartHigh-cardinality dimensions (user tier, region, feature flag)

You need both: monitoring catches regressions quickly; observability tooling lets engineers debug without shipping new printf statements.

Step 2 — The three pillars

Metrics, logs, and traces feed SLIs and SLOs which drive alerts.
Instrument all three with shared identifiers so an alert jumps to logs and traces in one click.

Metrics

Numeric time series—cheap to aggregate, ideal for dashboards and alerting.

Logs

Discrete events with context—best for “what happened to order 9182?” Prefer structured JSON over grep-friendly plain text in production.

Traces

End-to-end request paths across services—each span is one unit of work (HTTP handler, DB query). Traces explain latency composition.

Step 3 — Golden signals (Google SRE)

For user-facing services, chart these four:

  1. Latency — time to serve a request (distinguish success vs error latency).
  2. Traffic — demand (requests/sec, messages/sec).
  3. Errors — rate of failed requests (HTTP 5xx, exceptions).
  4. Saturation — how “full” the service is (CPU, memory, thread pool queue).

Maps cleanly to the RED method for services: Rate, Errors, Duration. For nodes and datastores use USE: Utilization, Saturation, Errors.

Step 4 — SLI, SLO, error budget

TermDefinitionExample
SLI (indicator)Measurable aspect of reliability“Ratio of successful HTTP requests < 500ms”
SLO (objective)Target for an SLI over a window“99.9% of requests succeed in 30 days”
SLA (agreement)Contract with customers (legal/financial)“99.95% uptime or credits”
Error budgetAllowed unreliability before SLO breach0.1% of 30 days ≈ 43 minutes downtime
availability_sli = successful_requests / total_requests
slo_target         = 0.999   # 99.9% over 30d rolling window
error_budget_left  = 1 - slo_target - (failures so far in window)

When the budget is exhausted, freeze risky releases and focus on reliability—product and engineering align on tradeoffs.

Step 5 — One request, three signals (correlation)

Generate a trace_id at the edge (ingress / API gateway) and propagate it:

{
  "level": "info",
  "msg": "payment captured",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "request_id": "req_8f2a",
  "order_id": "ord_9182",
  "duration_ms": 142
}

HTTP header convention: traceparent (W3C Trace Context) or X-Request-Id. In Kubernetes, sidecars or the app SDK attach the same ID to metrics labels where cardinality allows.

Step 6 — Where observability sits in the platform

Developer → CI build/test → deploy to K8s → metrics/logs/traces export
                                      ↓
                              Prometheus / Loki / Tempo (or vendor)
                                      ↓
                              Grafana dashboards + alert routes → PagerDuty/Slack

Step 7 — Cardinality and cost

Every unique label combination is a new time series. Safe labels: service, route, status_class (2xx/5xx). Risky labels: user_id, email—explodes storage and leaks PII.

Rule of thumb: If a label has more than ~10–50 values in steady state, do not use it on high-frequency metrics—log or trace it instead.

Step 8 — Alerting philosophy

Page humans for symptoms (SLO burn, user-visible errors), not every cause (CPU 70%). Good alert:

Burn-rate alerts (multi-window) detect fast budget consumption without noisy single-threshold pages—covered in the SLO guide coming next on this track.

Step 9 — Tool landscape (pick one stack)

ConcernOpen-source stackManaged examples
MetricsPrometheus + GrafanaDatadog, CloudWatch, Azure Monitor
LogsLoki, OpenSearchSplunk, Datadog Logs
TracesTempo, Jaeger (OTel collector)Honeycomb, Datadog APM
InstrumentationOpenTelemetry SDKVendor agents

OpenTelemetry is the neutral API—export to whichever backend your org standardizes on.

Step 10 — Anti-patterns

Step 11 — What to learn next on this track

Interview phrase: “We instrument RED metrics per service, structured logs with trace_id, distributed tracing on critical paths, and SLOs with error budgets—alerts page on symptom-based burn rates, and deploy gates respect remaining budget.”

The one line to remember

Observability is metrics + logs + traces tied together by request context, judged against SLOs—so on-call answers “what broke for customers?” not “is a graph green?”