Production systems

Real outages rarely start with “fix line 42.” They start with symptoms—OOM, stuck threads, stale cache, duplicate charges. Each guide in this track answers the same three questions: Why does this happen in production? What do you check first (in order)? How do you fix it now and prevent it next time?

The Why · What · How framework

Why

Root-cause categories—memory pressure, lock contention, config drift, missing backpressure—not blame. Tie symptom to mechanism (e.g. heap exhaustion vs leak vs oversized payload).

What

Ordered checklist: metrics → logs → thread dump → heap dump → recent deploy → dependency health. What to capture before restart destroys evidence.

How

Immediate mitigation (scale, circuit break, kill feature flag), durable fix (code, pool sizing, idempotency), and guardrails (alerts, load tests, runbooks).

JVM & runtime health

Memory, garbage collection, and processes that degrade over time.

  1. Guide

    Your application suddenly throws OutOfMemoryError. How will you debug it?

  2. Guide

    You suspect a memory leak. How will you confirm it and find the retaining path?

  3. Guide

    Frequent GC pauses are impacting latency. How will you tune and verify improvement?

  4. Guide

    The application becomes slow after a few hours of uptime. What will you check first?

  5. Guide

    The JVM process exits or the pod restarts without a clear stack trace in app logs. How will you debug it?

  6. Guide

    RSS memory keeps growing but heap looks stable. What besides the Java heap could be consuming memory?

  7. Guide

    Metaspace or class-loader leaks appear after many redeploys. How do you diagnose them?

CPU, threads & concurrency

High CPU with low traffic, blocked threads, deadlocks, and unsafe shared state.

  1. Guide

    CPU usage is ~90% but request traffic is very low. What could be wrong?

  2. Guide

    A thread is stuck in BLOCKED state. How will you identify what it is waiting on?

  3. Guide

    A deadlock is detected in production. How will you resolve it and prevent recurrence?

  4. Guide

    The request thread pool (or ForkJoinPool) is exhausted under load. What is your approach?

  5. Guide

    Multiple threads update shared data incorrectly. How will you fix concurrency bugs safely?

  6. Guide

    You need to handle very high concurrency safely. What design patterns and primitives will you use?

  7. Guide

    The service becomes randomly unresponsive for short windows. What could be happening internally?

  8. Guide

    HashMap (or ConcurrentHashMap) performance drops under heavy load. Why?

Production vs local & observability

Works on my machine, inconsistent logs, and tracing one request end to end.

  1. Guide

    The API works locally but fails in production. What will you investigate first?

  2. Guide

    Logs show inconsistent behavior across requests (same input, different outcome). Why?

  3. Guide

    You need to trace one request across microservices, databases, Kafka, and external APIs. How will you do it?

  4. Guide

    Support cannot correlate user complaints to logs across services. What do you add to the platform?

  5. Guide

    Latency regressed right after a deployment but error rate is still zero. How do you prove causation and roll back safely?

  6. Guide

    The gateway returns 502/504 but Kubernetes reports pods healthy. Where do you look next?

Database & persistence

Slow queries, pool exhaustion, and data-layer bottlenecks.

  1. Guide

    Database calls are slowing down the service. What optimizations will you apply?

  2. Guide

    The DB connection pool is exhausted and requests hang. How do you fix and size pools correctly?

  3. Guide

    One API endpoint triggers hundreds of SQL queries per request. How do you find and fix N+1?

  4. Guide

    Spikes in DB row lock wait time appear during peak traffic. What changed and what mitigations apply?

  5. Guide

    Users report stale reads after an update. How do you handle read-replica lag in the application?

Caching & consistency

Stale data, stampedes, and cache-aside pitfalls.

  1. Guide

    The cache returns stale data after an update. How will you solve cache consistency?

  2. Guide

    A hot key expires and the database is overwhelmed. How do you prevent a cache stampede?

  3. Guide

    Redis is unavailable. Should the app fail open or fail closed—and how do you decide?

Scaling, resilience & API correctness

Horizontal scale that does not help, duplicates, and backpressure.

  1. Guide

    You added more instances but latency did not improve. Why might horizontal scaling fail?

  2. Guide

    The system processes duplicate requests (retries, double-clicks). How will you make APIs idempotent?

  3. Guide

    A downstream service is slow and your service queues unbounded work. How do you add backpressure and circuit breaking?

  4. Guide

    Timeouts cascade across microservices during an incident. How do you set timeouts, retries, and bulkheads?

  5. Guide

    Kafka consumer lag grows for hours after a traffic spike. What is your triage order?

  6. Guide

    Clients retry aggressively on 429 and make the outage worse. How do you design client and server behavior?