Observability explained

When production misbehaves, “the server is down” is not enough—you need to know which dependency failed, for whom, and since when. Observability is the practice of instrumenting systems so you can answer novel questions from external signals: metrics, logs, and traces, grounded in SLOs that reflect user experience.

Helpful background: Kubernetes debugging (kubectl triage) and DevSec Core (deploy gates that can consult health signals).

After reading, you should be able to:

Distinguish monitoring from observability and name the three pillars.
Use RED and USE methods to pick the right metrics.
Define SLI, SLO, and error budget in plain language.
Correlate logs and traces with a shared trace_id / request_id.
Place observability in the path from deploy → prod → on-call.

Step 1 — Monitoring vs observability

Monitoring	Observability
Known failure modes, predefined dashboards and alerts	Explore unknowns—“why is checkout slow only in EU?”
“Is CPU high?”	“Which span in the payment service added 800ms?”
Threshold alerts on metrics you already chart	High-cardinality dimensions (user tier, region, feature flag)

You need both: monitoring catches regressions quickly; observability tooling lets engineers debug without shipping new printf statements.

Step 2 — The three pillars

Instrument all three with shared identifiers so an alert jumps to logs and traces in one click.

Metrics

Numeric time series—cheap to aggregate, ideal for dashboards and alerting.

Counter — only goes up (http_requests_total).
Gauge — up or down (queue_depth, memory_bytes).
Histogram — distribution of values (latency percentiles).

Logs

Discrete events with context—best for “what happened to order 9182?” Prefer structured JSON over grep-friendly plain text in production.

Traces

End-to-end request paths across services—each span is one unit of work (HTTP handler, DB query). Traces explain latency composition.

Step 3 — Golden signals (Google SRE)

For user-facing services, chart these four:

Latency — time to serve a request (distinguish success vs error latency).
Traffic — demand (requests/sec, messages/sec).
Errors — rate of failed requests (HTTP 5xx, exceptions).
Saturation — how “full” the service is (CPU, memory, thread pool queue).

Maps cleanly to the RED method for services: Rate, Errors, Duration. For nodes and datastores use USE: Utilization, Saturation, Errors.

Step 4 — SLI, SLO, error budget

Term	Definition	Example
SLI (indicator)	Measurable aspect of reliability	“Ratio of successful HTTP requests < 500ms”
SLO (objective)	Target for an SLI over a window	“99.9% of requests succeed in 30 days”
SLA (agreement)	Contract with customers (legal/financial)	“99.95% uptime or credits”
Error budget	Allowed unreliability before SLO breach	0.1% of 30 days ≈ 43 minutes downtime

availability_sli = successful_requests / total_requests
slo_target         = 0.999   # 99.9% over 30d rolling window
error_budget_left  = 1 - slo_target - (failures so far in window)

When the budget is exhausted, freeze risky releases and focus on reliability—product and engineering align on tradeoffs.

Step 5 — One request, three signals (correlation)

Generate a trace_id at the edge (ingress / API gateway) and propagate it:

{
  "level": "info",
  "msg": "payment captured",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "request_id": "req_8f2a",
  "order_id": "ord_9182",
  "duration_ms": 142
}

HTTP header convention: traceparent (W3C Trace Context) or X-Request-Id. In Kubernetes, sidecars or the app SDK attach the same ID to metrics labels where cardinality allows.

Step 6 — Where observability sits in the platform

Developer → CI build/test → deploy to K8s → metrics/logs/traces export
                                      ↓
                              Prometheus / Loki / Tempo (or vendor)
                                      ↓
                              Grafana dashboards + alert routes → PagerDuty/Slack

CI/CD — smoke tests after deploy; optional “canary SLO” gate before full promotion (environment gates).
Kubernetes — cAdvisor/node metrics, pod logs, liveness vs SLO (a passing probe ≠ happy users).
Infrastructure — cloud metrics (ALB 5xx, RDS CPU) complement app signals (platform layer).

Step 7 — Cardinality and cost

Every unique label combination is a new time series. Safe labels: service, route, status_class (2xx/5xx). Risky labels: user_id, email—explodes storage and leaks PII.

Rule of thumb: If a label has more than ~10–50 values in steady state, do not use it on high-frequency metrics—log or trace it instead.

Step 8 — Alerting philosophy

Page humans for symptoms (SLO burn, user-visible errors), not every cause (CPU 70%). Good alert:

Actionable — someone knows the first debugging step.
Linked — dashboard + runbook URL in the notification.
Rare — if it fires weekly and is ignored, delete or fix the threshold.

Burn-rate alerts (multi-window) detect fast budget consumption without noisy single-threshold pages—covered in the SLO guide coming next on this track.

Step 9 — Tool landscape (pick one stack)

Concern	Open-source stack	Managed examples
Metrics	Prometheus + Grafana	Datadog, CloudWatch, Azure Monitor
Logs	Loki, OpenSearch	Splunk, Datadog Logs
Traces	Tempo, Jaeger (OTel collector)	Honeycomb, Datadog APM
Instrumentation	OpenTelemetry SDK	Vendor agents

OpenTelemetry is the neutral API—export to whichever backend your org standardizes on.

Step 10 — Anti-patterns

Alerting on CPU > 80% without tying to user pain.
Unstructured logs you cannot query in an incident.
No trace propagation across async queues or internal RPCs.
SLOs written but never measured—dashboard fiction.
On-call runbooks that say “check the logs” with no link or query.

Step 11 — What to learn next on this track

Metrics & Prometheus — instrumentation, scrape configs, PromQL, recording rules, RED dashboards.
Logs & centralized logging — JSON schema, Promtail, Loki, LogQL, retention, K8s collection.
Distributed tracing — OpenTelemetry, OTLP, Tempo, traceparent propagation, sampling.
SLOs, alerting & on-call — burn-rate alerts, Alertmanager, runbooks, error-budget deploy gates.

Interview phrase: “We instrument RED metrics per service, structured logs with trace_id, distributed tracing on critical paths, and SLOs with error budgets—alerts page on symptom-based burn rates, and deploy gates respect remaining budget.”

The one line to remember

Observability is metrics + logs + traces tied together by request context, judged against SLOs—so on-call answers “what broke for customers?” not “is a graph green?”