Deadlock detected in production

Scenario

Alerts fire: all worker threads stuck, requests hang, or the JVM prints “Found one Java-level deadlock” in a thread dump. Two or more threads each hold a lock the other needs—a cycle. You must break the incident safely, find the code path, and stop it from returning on the next deploy.

After reading, you should be able to:

Explain deadlock vs high BLOCKED contention (one owner vs circular wait).
Read JVM deadlock sections and trace locks to source lines.
Handle database deadlocks (victim rollback, retry) separately from in-process locks.
Prevent with lock ordering, timeouts, and concurrency tests.

Why — circular wait freezes progress

A deadlock is when thread A holds lock L1 and waits for L2, while thread B holds L2 and waits for L1 (classic two-thread case). No thread in the cycle can proceed. The JVM may still run GC and JMX threads, but your request workers are frozen.

Coffman conditions (all four must hold)

Mutual exclusion — only one thread holds the resource.
Hold and wait — thread keeps one lock while waiting for another.
No preemption — locks are not forcibly taken away (normal Java monitors).
Circular wait — A → B → … → A.

Breaking any one condition prevents deadlock. In practice you usually enforce a global lock order or use tryLock with timeout so the cycle cannot form.

Deadlock vs contention

	Contention	Deadlock
Pattern	Many threads wait on one busy lock	Cycle of two+ locks
Progress	Slow but some threads complete	Involved threads make zero progress
Dump	One clear owner per monitor	JVM prints `Found one Java-level deadlock`

Where deadlocks show up in real systems

Nested synchronized — method A locks user then order; method B locks order then user.
Callback under lock — hold lock, call listener that tries to re-enter or acquire second lock.
Pool + lock — thread holds DB connection lock while waiting for pool slot; another holds pool slot waiting for connection.
ReentrantLock + synchronized — mixed APIs, inconsistent ordering.
Database — two transactions lock rows in opposite order; DB picks a victim and rolls back one (not a JVM deadlock, same symptom).
Distributed locks — Redis/DB advisory locks with TTL; “deadlock” is often lease expiry or forgotten unlock, not a JVM cycle.

JDK deadlock detection runs when you capture a thread dump (jstack, jcmd Thread.print). It only reports Java monitor deadlocks it can see—not DB deadlocks or logical deadlocks without locks.

What — resolve the incident and find the cycle

Immediate mitigation (before you have a root cause)

Confirm scope — one pod vs all replicas. Rolling restart of affected instances restores service; capture dumps first if possible.

Thread dump on stuck JVM

jcmd <pid> Thread.print > /tmp/deadlock-$(date +%s).txt

Read the deadlock section at end of dump

Found one Java-level deadlock:
=============================
"http-nio-8080-exec-7":
  waiting to lock monitor 0x00000000f1a2b3c0 (object 0x00000000e88order, com.app.OrderService),
  which is held by "http-nio-8080-exec-3"
"http-nio-8080-exec-3":
  waiting to lock monitor 0x00000000f4d5e6f0 (object 0x00000000e99user, com.app.UserService),
  which is held by "http-nio-8080-exec-7"

Draw the cycle: exec-7 wants OrderService (held by exec-3); exec-3 wants UserService (held by exec-7).

Scroll each thread’s full stack — find - locked and - waiting to lock lines with file:line.
Map to recent change — deploy, feature flag, new integration that introduced nested locking.
If no JVM deadlock section — check DB logs for deadlock detected / SQL Server 1205 / MySQL 1213; check all threads TIMED_WAITING on pool (see thread pool exhausted).
Save artifacts — 2–3 dumps, heap not always needed; JFR recording if still running.

Database deadlock checklist

Identify victim transaction from DB log (PostgreSQL: deadlock detected detail; MySQL: SHOW ENGINE INNODB STATUS).
Match SQL and table/index order in application code.
Confirm app retries idempotent operations on serialization/deadlock error (with backoff cap).
Shorten transactions; avoid user-facing work inside TX.

Tools

jstack / jcmd — built-in deadlock report.
VisualVM, IntelliJ — parse dumps, highlight cycles.
JFR — Java Monitor Blocked events leading up to incident.
FindBugs / SpotBugs, Error Prone — static lock-order warnings (limited).
jcstress, stress tests — reproduce rare orderings in CI.

How — break the cycle and prevent recurrence

Durable fixes (in-process)

Technique	When to use
Global lock ordering	Always acquire `userId` then `orderId` (compare IDs if pairing arbitrary entities).
Single lock per aggregate	One lock for “transfer” spanning user+order instead of two object monitors.
Shrink critical section	Compute outside lock; lock only to update shared structure.
No callbacks under lock	Copy state, release lock, then invoke listeners.
`tryLock(timeout)`	Fail fast with 503 instead of freezing the pool forever.
Concurrent structures	Remove locks where `ConcurrentHashMap` / queues suffice.

Lock ordering example

// Bad: opposite order in two code paths
void transferA() { synchronized(user) { synchronized(order) { ... } } }
void transferB() { synchronized(order) { synchronized(user) { ... } } }

// Good: always lock lower id first (total order)
void lockPair(Object a, Object b, Runnable work) {
  Object first = System.identityHashCode(a) < System.identityHashCode(b) ? a : b;
  Object second = first == a ? b : a;
  synchronized (first) {
    synchronized (second) {
      work.run();
    }
  }
}

Production code often uses explicit lock IDs (userId, orderId) rather than identity hash—same idea: one canonical order.

Database fixes

Access tables in a fixed order in all code paths.
Use consistent index paths so row locks align (avoid gap-lock surprises in MySQL).
Retry deadlock victims with exponential backoff; cap retries and alert.
Reduce lock scope: smaller transactions, lower isolation only if business allows.

Verify after fix

Load test that previously stuck threads (same RPS, duration ≥ 30 min).
Thread dumps under load: no Found one Java-level deadlock.
DB: deadlock rate metric near zero; retries succeed without user-visible errors.
Chaos: optional ThreadMXBean.findDeadlockedThreads() in health check during staging soak.

Prevention guardrails

Design rule: max one lock per request path, or documented lock hierarchy in module README.
Code review checklist: nested synchronized, listeners under lock, cross-service lock order.
Integration tests with concurrent threads hammering transfer/booking flows.
Alert: thread pool at 100% + zero completed requests for N minutes.

Interview one-liner

“I capture a thread dump and read the JVM’s deadlock section to get the cycle and stacks. Mitigate with restart if needed, then fix with a global lock order or by removing nested locks. For the database I align row access order and retry victims idempotently.”

Why — circular wait freezes progress

Coffman conditions (all four must hold)

Deadlock vs contention

Where deadlocks show up in real systems

What — resolve the incident and find the cycle

Immediate mitigation (before you have a root cause)

Database deadlock checklist

Tools

How — break the cycle and prevent recurrence

Durable fixes (in-process)

Lock ordering example

Database fixes

Verify after fix

Prevention guardrails

Interview one-liner

Related scenarios