Multiple threads update shared data incorrectly
Scenario
Production shows wrong balances, duplicate events, impossible counters, or rare crashes that disappear on retry. Tests pass on a laptop. Under load, two threads mutate the same field or collection without a happens-before relationship—a data race. You must find the shared mutable state and fix it without freezing the whole service.
After reading, you should be able to:
- Tell a race from deadlock or lock contention (wrong data vs stuck threads).
- Spot check-then-act, non-atomic read-modify-write, and unsafe
HashMappatterns. - Confirm with stress tests and targeted logging—not only thread dumps.
- Fix with confinement, immutability, atomics, or concurrent collections.
Why — interleaving breaks assumptions
A data race occurs when two or more threads access the same memory location, at least one write is involved, and there is no synchronization establishing happens-before between them. The JVM does not guarantee which order you see; logic that works on one thread fails when another interleaves.
Race vs other concurrency bugs
| Bug | Symptom | Threads |
|---|---|---|
| Data race | Wrong values, duplicates, torn reads | Usually still RUNNABLE; service “works” but lies |
| Deadlock | Hang, zero progress | Stuck in lock cycle — see deadlock guide |
| Contention | Slow, timeouts | Many BLOCKED on one lock |
| Pool exhausted | 503, all workers busy | Waiting on I/O — see pool guide |
Classic patterns that race in production
- Check-then-act —
if (!map.containsKey(k)) map.put(k, v);two threads both pass the check. - Non-atomic counter —
count++on a sharedintfield (lost updates). - Unsafe collections —
HashMap,ArrayListmutated from many request threads. - Mutable singleton cache — static map updated without sync; visible in one pod under burst traffic.
- Broken double-checked locking — partially constructed object published without
volatile. - Compound actions — read balance, subtract, write; another thread interleaves between read and write.
- Visibility only — one thread writes a flag, another reads without
volatile/ sync; reader never sees update.
Heisenbugs. Races are timing-dependent. A bug may appear only on certain hardware, after a deploy that changes GC pauses, or at peak QPS. “Cannot reproduce locally” often means insufficient concurrency in the test, not absence of the bug.
What — find shared mutable state and prove the race
- Define the invariant that broke — e.g. “account balance never negative,” “event id unique,” “inventory never below zero.” That tells you which variables must be atomic as a unit.
- Correlate with load — errors spike with traffic? Single region/pod? After feature flag? Points to per-instance mutable cache vs DB issue.
-
Audit code paths for shared mutation
Search for:
staticnon-final fields, especially collections and counters- Singleton beans holding mutable maps/lists
HashMap/ArrayListon objects shared across requestsif (thenput/addwithout lock orcomputeIfAbsent
- Thread dumps are secondary for races — dumps show where threads are, not that a counter was lost. Use dumps to rule out deadlock and massive blocking.
-
Reproduce under stress
// JUnit: many threads, CountDownLatch start gate, CyclicBarrier ExecutorService ex = Executors.newFixedThreadPool(32); // 10_000 iterations: same code path as production // Assert invariant (sum, size, no duplicates)
Tools: jcstress, multithreaded stress IT in CI, Gatling/k6 at 2× peak RPS in staging. - Targeted logging (temporary) — log thread id + before/after values on suspect updates; compare sum of parts vs global counter in metrics.
- Business reconciliation — DB totals vs in-memory cache; payment ledger vs API counter. Mismatch localizes which subsystem races.
-
Rule out DB race
— lost update at DB layer needs transaction isolation or optimistic locking (
UPDATE … WHERE version = ?), not only Java fixes.
Smoking-gun code smells
// Race: two threads can both create
if (!cache.containsKey(id)) {
cache.put(id, loadExpensive(id));
}
// Race: lost increments
metrics.successCount++;
// Race: ConcurrentModificationException or corrupt bucket
sharedList.add(item); // ArrayList from many threads
How — fix safely without over-locking
Fix hierarchy (prefer top to bottom)
| Approach | Use when |
|---|---|
| Immutability | Replace map with new immutable copy on change; readers see stable snapshot |
| Thread confinement | Object never leaves creating thread (per-request locals only) |
| Concurrent collections / atomics | Shared cache, counters — ConcurrentHashMap, LongAdder, AtomicReference |
| Single-writer queue | All mutations on one actor thread; API posts events |
| Synchronized / Lock | Compound invariant spanning multiple fields |
| Database as source of truth | Unique constraint + transactional update; cache is read-through only |
Before / after examples
// Safe lazy init per key
cache.computeIfAbsent(id, this::loadExpensive);
// Safe counter
private final LongAdder successCount = new LongAdder();
successCount.increment();
// Safe publish of immutable snapshot
private volatile Map<String, Config> configRef = Map.of();
void reload(Map<String, Config> next) {
configRef = Map.copyOf(next); // readers see old or new, never torn
}
Financial / inventory style updates
// Bad: read-modify-write on shared object balance = balance - amount; // Good: DB with row lock or optimistic version UPDATE account SET balance = balance - ?, version = version + 1 WHERE id = ? AND version = ? AND balance >= ?
What not to do
- Wrap entire service methods in
synchronized— fixes races by serializing everything; creates contention and pool exhaustion. - Rely on
volatileforcount++— volatile does not make compound ops atomic. - Use
Collections.synchronizedMapfor compound check-then-act without synchronizing on the map monitor for the whole operation.
Verify the fix
- Stress test that failed before: same thread count, duration ≥ 10 min.
- Reconciliation job: invariants hold over 24h in staging/production canary.
- Code review: no new shared mutable statics without documented thread-safety.
- Optional: enable
-XX:+UnlockDiagnosticVMOptions -XX:+StressConcurrentonly in dedicated test JVM (not prod).
Prevention
- Default new services to immutable DTOs and stateless handlers.
- Checklist: any
staticmutable field needs explicit concurrency story. - CI: multithreaded tests for cache, idempotency, and counter modules.
- Prefer
ConcurrentHashMapover synchronizedHashMapwhen map is truly shared.
Interview one-liner
“I define the invariant that broke, find shared mutable state updated without synchronization, reproduce with a concurrent stress test, then fix with confinement, atomics, or computeIfAbsent—and use the database for authoritative state when money or inventory is involved.”