Memory leak in production
Scenario
You suspect a memory leak: heap usage climbs over hours, GC runs more often but reclaim less, and eventually you hit OutOfMemoryError or the pod is OOMKilled. Traffic is flat. How do you prove it is a leak—and find what is holding the memory?
After reading, you should be able to:
- Distinguish a leak from legitimate growth (cache warmup, higher load).
- Confirm with metrics and two heap dumps separated in time.
- Find the retaining path with MAT dominators and GC roots.
- Fix common patterns: static maps, listeners,
ThreadLocal, classloaders.
Related: OutOfMemoryError in production (when the JVM finally fails).
Why — what a “memory leak” is in Java
Java does not leak memory the way C does with forgotten free(). Objects become eligible for GC when
no GC root can reach them. A “leak” means your code still holds references—directly or indirectly—so
objects that should be dead stay alive forever (or until redeploy).
Leak vs not a leak
| Pattern | Heap after GC | Usually |
|---|---|---|
| Leak | Old gen baseline keeps rising | Unbounded collection, missing cleanup |
| Legitimate growth | Rises then plateaus | Cache fills to configured max; new feature holds more per request |
| Traffic-correlated | Tracks QPS, drops when quiet | Session map, request-scoped data not leak if cleared |
| Metaspace leak | Heap OK; metaspace climbs | Classloader not collected (hot redeploy, OSGi-style plugins) |
Common retaining causes
- Static or singleton collections —
private static Map<String, User> cache = new HashMap<>()with no eviction; every user ever seen stays forever. - Listeners / callbacks not removed — observer registered on long-lived bus; short-lived object never unregistered.
- ThreadLocal without
remove()— thread pools reuse threads; value survives across requests (common in Tomcat/Spring apps). - Incorrect cache key — key includes timestamp or request id → infinite distinct entries.
- HTTP session or custom “context” maps — objects stored per session and never timeout.
- Classloader leak — undeployed WAR’s loader retained by a stray reference; all its classes and statics stay loaded.
- Off-heap / direct buffers — not a classic heap leak but RSS grows; see heap vs RSS.
Mental model: A leak is a reference management bug, not “GC is broken.” Fix who points at the object, not only heap size.
What — confirm the leak, then find the holder
Phase A — Is it really a leak?
-
Plot heap used after full GC over 24h
Stair-step up with no plateau at steady traffic → suspect leak. Use JMX/Micrometer
jvm.memory.used(old gen) or GC log “heap after GC” lines. - Compare to traffic and deploys Step change at deploy → config or code change. Linear growth with flat QPS → leak. Growth only during peak → may be request-scoped accumulation.
- Check metaspace separately If only metaspace rises → classloader leak, not heap object leak. Different tools and fix.
- Rule out container limit creep RSS at limit but heap moderate → native/direct/thread stacks; not solved by MAT heap analysis alone.
# Quick live histogram (no full dump) — run twice 30 min apart jcmd <pid> GC.class_histogram | head -40 # If same class count (e.g. com.app.SessionData) doubled → strong signal
Phase B — Capture evidence
-
Heap dump #1 — after service warm, baseline traffic:
jcmd <pid> GC.heap_dump /dumps/heap-baseline.hprof - Wait 30–60 minutes (or until old gen grows noticeably) at stable load
-
Heap dump #2 —
heap-growth.hprof. Optionally trigger full GC before dump:jcmd <pid> GC.runthen dump - Thread dump — same timestamps; count threads, spot custom pool growth
- GC logs — confirm old gen post-GC size trend between dumps
In Kubernetes, schedule dumps to a mounted volume or use kubectl exec before pod restart. Automate with cron sidecar only if dumps are rotated—.hprof files are large.
Phase C — Analyze in Eclipse MAT
- Open both dumps → Compare Basket (or “Compare to another heap dump”).
- Sort by delta retained heap — which types grew most between dumps?
- On the growth dump, open Dominator tree for the top class.
- Path to GC roots → exclude weak/soft refs first; find strongest path (often static field, thread local, or cache singleton).
- Run Leak Suspects Report on the second dump for automated hints.
- Inspect Duplicate strings /
char[]if leak is many similar keys (bad cache keys).
What the retaining path often looks like
java.lang.Thread
└─ ThreadLocalMap
└─ ThreadLocal$Entry
└─ YourContext.userCache
└─ HashMap$Node[]
└─ millions of SessionDto (LEAK)
Lightweight profiling (without full dump)
- async-profiler —
allocevent in staging reproduces hot allocation sites (where objects are born; leak is where they are kept). - JFR (Java Flight Recorder) — allocation profiling with low overhead in JDK 11+.
- Prometheus + Grafana — alert on old gen % rising for 2h while QPS flat.
How — fix, verify, prevent
Fix patterns by root cause
| Root cause | Fix |
|---|---|
| Unbounded map/cache | Caffeine/Guava with maximumSize, expireAfterWrite; or periodic cleanup job |
| ThreadLocal | try/finally { ... } finally { threadLocal.remove(); } in filters/interceptors |
| Event listener | unregister in @PreDestroy / servlet destroy; weak listeners only if semantics allow |
| Session attributes | Shorter timeout; don’t store large graphs in session; use server-side store with TTL |
| Classloader leak | Remove reference from singleton; avoid JDBC driver registered from wrong loader; restart on redeploy in dev |
| Generated classes | Limit dynamic codegen; pool script engines; upgrade libs with known metaspace bugs |
Example: bounded cache replacement
// Before: leak under load
private static final Map<String, byte[]> REPORT_CACHE = new ConcurrentHashMap<>();
// After: bounded + TTL
private static final Cache<String, byte[]> REPORT_CACHE = Caffeine.newBuilder()
.maximumSize(10_000)
.expireAfterWrite(Duration.ofMinutes(30))
.build();
Example: ThreadLocal cleanup in a servlet filter
public void doFilter(...) {
try {
RequestContext.set(userId);
chain.doFilter(request, response);
} finally {
RequestContext.clear(); // must call ThreadLocal.remove() inside
}
}
Verify the fix
- Deploy to staging with same soak duration (4–8 hours) and flat load generator.
- Old gen after GC should plateau, not climb linearly.
- Repeat compare-dump test: delta retained heap near zero after plateau.
- Load test in CI optional: fail build if old gen > threshold after 1h soak (advanced).
Prevention
- Code review checklist: any static
Map/Listmust have documented bounds. - Always clear
ThreadLocalin pooled-thread environments. - Dashboard: heap after GC, metaspace, allocation rate, pod restart count.
- Alert: old gen > 80% for 30 min while RPS within ±10% of 24h median.
- Document max in-memory rows for batch jobs; stream DB results instead of
findAll().
When not to chase a “leak”
- Heap grew because you legitimately raised cache size in config—verify config diff.
- One-off batch job loaded 2M rows—fix job, not JVM; heap drops after job ends.
- Only fix is more heap for new product scale—capacity plan, not leak hunt.
Interview one-liner
“I confirm a leak with post-GC heap trending up at flat traffic, take two heap dumps spaced apart, compare retained size in MAT, follow dominators to GC roots, then fix the retaining reference—usually an unbounded cache or ThreadLocal—not just increase -Xmx.”
Related scenarios
- OutOfMemoryError — when the JVM runs out of room.
- Slow after a few hours — often the first visible symptom of a leak.
- Frequent GC pauses — when old gen is full of live objects.
- Metaspace growth — classloader leaks.