Deadlock detected in production
Scenario
Alerts fire: all worker threads stuck, requests hang, or the JVM prints “Found one Java-level deadlock” in a thread dump. Two or more threads each hold a lock the other needs—a cycle. You must break the incident safely, find the code path, and stop it from returning on the next deploy.
After reading, you should be able to:
- Explain deadlock vs high BLOCKED contention (one owner vs circular wait).
- Read JVM deadlock sections and trace locks to source lines.
- Handle database deadlocks (victim rollback, retry) separately from in-process locks.
- Prevent with lock ordering, timeouts, and concurrency tests.
Why — circular wait freezes progress
A deadlock is when thread A holds lock L1 and waits for L2, while thread B holds L2 and waits for L1 (classic two-thread case). No thread in the cycle can proceed. The JVM may still run GC and JMX threads, but your request workers are frozen.
Coffman conditions (all four must hold)
- Mutual exclusion — only one thread holds the resource.
- Hold and wait — thread keeps one lock while waiting for another.
- No preemption — locks are not forcibly taken away (normal Java monitors).
- Circular wait — A → B → … → A.
Breaking any one condition prevents deadlock. In practice you usually enforce a global lock order or use tryLock with timeout so the cycle cannot form.
Deadlock vs contention
| Contention | Deadlock | |
|---|---|---|
| Pattern | Many threads wait on one busy lock | Cycle of two+ locks |
| Progress | Slow but some threads complete | Involved threads make zero progress |
| Dump | One clear owner per monitor | JVM prints Found one Java-level deadlock |
Where deadlocks show up in real systems
- Nested
synchronized— method A locksuserthenorder; method B locksorderthenuser. - Callback under lock — hold lock, call listener that tries to re-enter or acquire second lock.
- Pool + lock — thread holds DB connection lock while waiting for pool slot; another holds pool slot waiting for connection.
ReentrantLock+synchronized— mixed APIs, inconsistent ordering.- Database — two transactions lock rows in opposite order; DB picks a victim and rolls back one (not a JVM deadlock, same symptom).
- Distributed locks — Redis/DB advisory locks with TTL; “deadlock” is often lease expiry or forgotten unlock, not a JVM cycle.
JDK deadlock detection runs when you capture a thread dump (jstack, jcmd Thread.print). It only reports Java monitor deadlocks it can see—not DB deadlocks or logical deadlocks without locks.
What — resolve the incident and find the cycle
Immediate mitigation (before you have a root cause)
- Confirm scope — one pod vs all replicas. Rolling restart of affected instances restores service; capture dumps first if possible.
-
Thread dump on stuck JVM
jcmd <pid> Thread.print > /tmp/deadlock-$(date +%s).txt
-
Read the deadlock section at end of dump
Found one Java-level deadlock: ============================= "http-nio-8080-exec-7": waiting to lock monitor 0x00000000f1a2b3c0 (object 0x00000000e88order, com.app.OrderService), which is held by "http-nio-8080-exec-3" "http-nio-8080-exec-3": waiting to lock monitor 0x00000000f4d5e6f0 (object 0x00000000e99user, com.app.UserService), which is held by "http-nio-8080-exec-7"
Draw the cycle: exec-7 wants OrderService (held by exec-3); exec-3 wants UserService (held by exec-7). -
Scroll each thread’s full stack — find
- lockedand- waiting to locklines with file:line. - Map to recent change — deploy, feature flag, new integration that introduced nested locking.
-
If no JVM deadlock section — check DB logs for
deadlock detected/ SQL Server 1205 / MySQL 1213; check all threadsTIMED_WAITINGon pool (see thread pool exhausted). - Save artifacts — 2–3 dumps, heap not always needed; JFR recording if still running.
Database deadlock checklist
- Identify victim transaction from DB log (PostgreSQL:
deadlock detecteddetail; MySQL:SHOW ENGINE INNODB STATUS). - Match SQL and table/index order in application code.
- Confirm app retries idempotent operations on serialization/deadlock error (with backoff cap).
- Shorten transactions; avoid user-facing work inside TX.
Tools
- jstack / jcmd — built-in deadlock report.
- VisualVM, IntelliJ — parse dumps, highlight cycles.
- JFR — Java Monitor Blocked events leading up to incident.
- FindBugs / SpotBugs, Error Prone — static lock-order warnings (limited).
- jcstress, stress tests — reproduce rare orderings in CI.
How — break the cycle and prevent recurrence
Durable fixes (in-process)
| Technique | When to use |
|---|---|
| Global lock ordering | Always acquire userId then orderId (compare IDs if pairing arbitrary entities). |
| Single lock per aggregate | One lock for “transfer” spanning user+order instead of two object monitors. |
| Shrink critical section | Compute outside lock; lock only to update shared structure. |
| No callbacks under lock | Copy state, release lock, then invoke listeners. |
tryLock(timeout) | Fail fast with 503 instead of freezing the pool forever. |
| Concurrent structures | Remove locks where ConcurrentHashMap / queues suffice. |
Lock ordering example
// Bad: opposite order in two code paths
void transferA() { synchronized(user) { synchronized(order) { ... } } }
void transferB() { synchronized(order) { synchronized(user) { ... } } }
// Good: always lock lower id first (total order)
void lockPair(Object a, Object b, Runnable work) {
Object first = System.identityHashCode(a) < System.identityHashCode(b) ? a : b;
Object second = first == a ? b : a;
synchronized (first) {
synchronized (second) {
work.run();
}
}
}
Production code often uses explicit lock IDs (userId, orderId) rather than identity hash—same idea: one canonical order.
Database fixes
- Access tables in a fixed order in all code paths.
- Use consistent index paths so row locks align (avoid gap-lock surprises in MySQL).
- Retry deadlock victims with exponential backoff; cap retries and alert.
- Reduce lock scope: smaller transactions, lower isolation only if business allows.
Verify after fix
- Load test that previously stuck threads (same RPS, duration ≥ 30 min).
- Thread dumps under load: no
Found one Java-level deadlock. - DB: deadlock rate metric near zero; retries succeed without user-visible errors.
- Chaos: optional
ThreadMXBean.findDeadlockedThreads()in health check during staging soak.
Prevention guardrails
- Design rule: max one lock per request path, or documented lock hierarchy in module README.
- Code review checklist: nested
synchronized, listeners under lock, cross-service lock order. - Integration tests with concurrent threads hammering transfer/booking flows.
- Alert: thread pool at 100% + zero completed requests for N minutes.
Interview one-liner
“I capture a thread dump and read the JVM’s deadlock section to get the cycle and stacks. Mitigate with restart if needed, then fix with a global lock order or by removing nested locks. For the database I align row access order and retry victims idempotently.”
Related scenarios
- Thread BLOCKED — single-lock contention
- Thread pool exhausted
- Shared data races
- Randomly unresponsive