Deadlock detected in production

Scenario

Alerts fire: all worker threads stuck, requests hang, or the JVM prints “Found one Java-level deadlock” in a thread dump. Two or more threads each hold a lock the other needs—a cycle. You must break the incident safely, find the code path, and stop it from returning on the next deploy.

After reading, you should be able to:

Why — circular wait freezes progress

A deadlock is when thread A holds lock L1 and waits for L2, while thread B holds L2 and waits for L1 (classic two-thread case). No thread in the cycle can proceed. The JVM may still run GC and JMX threads, but your request workers are frozen.

Coffman conditions (all four must hold)

  1. Mutual exclusion — only one thread holds the resource.
  2. Hold and wait — thread keeps one lock while waiting for another.
  3. No preemption — locks are not forcibly taken away (normal Java monitors).
  4. Circular wait — A → B → … → A.

Breaking any one condition prevents deadlock. In practice you usually enforce a global lock order or use tryLock with timeout so the cycle cannot form.

Deadlock vs contention

ContentionDeadlock
PatternMany threads wait on one busy lockCycle of two+ locks
ProgressSlow but some threads completeInvolved threads make zero progress
DumpOne clear owner per monitorJVM prints Found one Java-level deadlock

Where deadlocks show up in real systems

JDK deadlock detection runs when you capture a thread dump (jstack, jcmd Thread.print). It only reports Java monitor deadlocks it can see—not DB deadlocks or logical deadlocks without locks.

What — resolve the incident and find the cycle

Immediate mitigation (before you have a root cause)

  1. Confirm scope — one pod vs all replicas. Rolling restart of affected instances restores service; capture dumps first if possible.
  2. Thread dump on stuck JVM
    jcmd <pid> Thread.print > /tmp/deadlock-$(date +%s).txt
  3. Read the deadlock section at end of dump
    Found one Java-level deadlock:
    =============================
    "http-nio-8080-exec-7":
      waiting to lock monitor 0x00000000f1a2b3c0 (object 0x00000000e88order, com.app.OrderService),
      which is held by "http-nio-8080-exec-3"
    "http-nio-8080-exec-3":
      waiting to lock monitor 0x00000000f4d5e6f0 (object 0x00000000e99user, com.app.UserService),
      which is held by "http-nio-8080-exec-7"
    Draw the cycle: exec-7 wants OrderService (held by exec-3); exec-3 wants UserService (held by exec-7).
  4. Scroll each thread’s full stack — find - locked and - waiting to lock lines with file:line.
  5. Map to recent change — deploy, feature flag, new integration that introduced nested locking.
  6. If no JVM deadlock section — check DB logs for deadlock detected / SQL Server 1205 / MySQL 1213; check all threads TIMED_WAITING on pool (see thread pool exhausted).
  7. Save artifacts — 2–3 dumps, heap not always needed; JFR recording if still running.

Database deadlock checklist

  1. Identify victim transaction from DB log (PostgreSQL: deadlock detected detail; MySQL: SHOW ENGINE INNODB STATUS).
  2. Match SQL and table/index order in application code.
  3. Confirm app retries idempotent operations on serialization/deadlock error (with backoff cap).
  4. Shorten transactions; avoid user-facing work inside TX.

Tools

How — break the cycle and prevent recurrence

Durable fixes (in-process)

TechniqueWhen to use
Global lock orderingAlways acquire userId then orderId (compare IDs if pairing arbitrary entities).
Single lock per aggregateOne lock for “transfer” spanning user+order instead of two object monitors.
Shrink critical sectionCompute outside lock; lock only to update shared structure.
No callbacks under lockCopy state, release lock, then invoke listeners.
tryLock(timeout)Fail fast with 503 instead of freezing the pool forever.
Concurrent structuresRemove locks where ConcurrentHashMap / queues suffice.

Lock ordering example

// Bad: opposite order in two code paths
void transferA() { synchronized(user) { synchronized(order) { ... } } }
void transferB() { synchronized(order) { synchronized(user) { ... } } }

// Good: always lock lower id first (total order)
void lockPair(Object a, Object b, Runnable work) {
  Object first = System.identityHashCode(a) < System.identityHashCode(b) ? a : b;
  Object second = first == a ? b : a;
  synchronized (first) {
    synchronized (second) {
      work.run();
    }
  }
}

Production code often uses explicit lock IDs (userId, orderId) rather than identity hash—same idea: one canonical order.

Database fixes

Verify after fix

  1. Load test that previously stuck threads (same RPS, duration ≥ 30 min).
  2. Thread dumps under load: no Found one Java-level deadlock.
  3. DB: deadlock rate metric near zero; retries succeed without user-visible errors.
  4. Chaos: optional ThreadMXBean.findDeadlockedThreads() in health check during staging soak.

Prevention guardrails

Interview one-liner

“I capture a thread dump and read the JVM’s deadlock section to get the cycle and stacks. Mitigate with restart if needed, then fix with a global lock order or by removing nested locks. For the database I align row access order and retry victims idempotently.”

Related scenarios