Production systems

Real outages rarely start with “fix line 42.” They start with symptoms—OOM, stuck threads, stale cache, duplicate charges. Each guide in this track answers the same three questions: Why does this happen in production? What do you check first (in order)? How do you fix it now and prevent it next time?

The Why · What · How framework

Why

Root-cause categories—memory pressure, lock contention, config drift, missing backpressure—not blame. Tie symptom to mechanism (e.g. heap exhaustion vs leak vs oversized payload).

What

Ordered checklist: metrics → logs → thread dump → heap dump → recent deploy → dependency health. What to capture before restart destroys evidence.

How

Immediate mitigation (scale, circuit break, kill feature flag), durable fix (code, pool sizing, idempotency), and guardrails (alerts, load tests, runbooks).

JVM & runtime health

Memory, garbage collection, and processes that degrade over time.

CPU, threads & concurrency

High CPU with low traffic, blocked threads, deadlocks, and unsafe shared state.

Production vs local & observability

Works on my machine, inconsistent logs, and tracing one request end to end.

Database & persistence

Slow queries, pool exhaustion, and data-layer bottlenecks.

Caching & consistency

Stale data, stampedes, and cache-aside pitfalls.

Scaling, resilience & API correctness

Horizontal scale that does not help, duplicates, and backpressure.