Why
Root-cause categories—memory pressure, lock contention, config drift, missing backpressure—not blame. Tie symptom to mechanism (e.g. heap exhaustion vs leak vs oversized payload).
Real outages rarely start with “fix line 42.” They start with symptoms—OOM, stuck threads, stale cache, duplicate charges. Each guide in this track answers the same three questions: Why does this happen in production? What do you check first (in order)? How do you fix it now and prevent it next time?
Root-cause categories—memory pressure, lock contention, config drift, missing backpressure—not blame. Tie symptom to mechanism (e.g. heap exhaustion vs leak vs oversized payload).
Ordered checklist: metrics → logs → thread dump → heap dump → recent deploy → dependency health. What to capture before restart destroys evidence.
Immediate mitigation (scale, circuit break, kill feature flag), durable fix (code, pool sizing, idempotency), and guardrails (alerts, load tests, runbooks).
Memory, garbage collection, and processes that degrade over time.
Your application suddenly throws OutOfMemoryError. How will you debug it?
You suspect a memory leak. How will you confirm it and find the retaining path?
Frequent GC pauses are impacting latency. How will you tune and verify improvement?
The application becomes slow after a few hours of uptime. What will you check first?
Metaspace or class-loader leaks appear after many redeploys. How do you diagnose them?
High CPU with low traffic, blocked threads, deadlocks, and unsafe shared state.
CPU usage is ~90% but request traffic is very low. What could be wrong?
A thread is stuck in BLOCKED state. How will you identify what it is waiting on?
A deadlock is detected in production. How will you resolve it and prevent recurrence?
The request thread pool (or ForkJoinPool) is exhausted under load. What is your approach?
You need to handle very high concurrency safely. What design patterns and primitives will you use?
The service becomes randomly unresponsive for short windows. What could be happening internally?
HashMap (or ConcurrentHashMap) performance drops under heavy load. Why?
Works on my machine, inconsistent logs, and tracing one request end to end.
The API works locally but fails in production. What will you investigate first?
Logs show inconsistent behavior across requests (same input, different outcome). Why?
Support cannot correlate user complaints to logs across services. What do you add to the platform?
The gateway returns 502/504 but Kubernetes reports pods healthy. Where do you look next?
Slow queries, pool exhaustion, and data-layer bottlenecks.
Database calls are slowing down the service. What optimizations will you apply?
The DB connection pool is exhausted and requests hang. How do you fix and size pools correctly?
One API endpoint triggers hundreds of SQL queries per request. How do you find and fix N+1?
Spikes in DB row lock wait time appear during peak traffic. What changed and what mitigations apply?
Users report stale reads after an update. How do you handle read-replica lag in the application?
Stale data, stampedes, and cache-aside pitfalls.
Horizontal scale that does not help, duplicates, and backpressure.
You added more instances but latency did not improve. Why might horizontal scaling fail?
The system processes duplicate requests (retries, double-clicks). How will you make APIs idempotent?
Kafka consumer lag grows for hours after a traffic spike. What is your triage order?