Memory leak in production

Scenario

You suspect a memory leak: heap usage climbs over hours, GC runs more often but reclaim less, and eventually you hit OutOfMemoryError or the pod is OOMKilled. Traffic is flat. How do you prove it is a leak—and find what is holding the memory?

After reading, you should be able to:

Distinguish a leak from legitimate growth (cache warmup, higher load).
Confirm with metrics and two heap dumps separated in time.
Find the retaining path with MAT dominators and GC roots.
Fix common patterns: static maps, listeners, ThreadLocal, classloaders.

Related: OutOfMemoryError in production (when the JVM finally fails).

Why — what a “memory leak” is in Java

Java does not leak memory the way C does with forgotten free(). Objects become eligible for GC when no GC root can reach them. A “leak” means your code still holds references—directly or indirectly—so objects that should be dead stay alive forever (or until redeploy).

Leak vs not a leak

Pattern	Heap after GC	Usually
Leak	Old gen baseline keeps rising	Unbounded collection, missing cleanup
Legitimate growth	Rises then plateaus	Cache fills to configured max; new feature holds more per request
Traffic-correlated	Tracks QPS, drops when quiet	Session map, request-scoped data not leak if cleared
Metaspace leak	Heap OK; metaspace climbs	Classloader not collected (hot redeploy, OSGi-style plugins)

Common retaining causes

Static or singleton collections — private static Map<String, User> cache = new HashMap<>() with no eviction; every user ever seen stays forever.
Listeners / callbacks not removed — observer registered on long-lived bus; short-lived object never unregistered.
ThreadLocal without remove() — thread pools reuse threads; value survives across requests (common in Tomcat/Spring apps).
Incorrect cache key — key includes timestamp or request id → infinite distinct entries.
HTTP session or custom “context” maps — objects stored per session and never timeout.
Classloader leak — undeployed WAR’s loader retained by a stray reference; all its classes and statics stay loaded.
Off-heap / direct buffers — not a classic heap leak but RSS grows; see heap vs RSS.

Mental model: A leak is a reference management bug, not “GC is broken.” Fix who points at the object, not only heap size.

What — confirm the leak, then find the holder

Phase A — Is it really a leak?

Plot heap used after full GC over 24h Stair-step up with no plateau at steady traffic → suspect leak. Use JMX/Micrometer jvm.memory.used (old gen) or GC log “heap after GC” lines.
Compare to traffic and deploys Step change at deploy → config or code change. Linear growth with flat QPS → leak. Growth only during peak → may be request-scoped accumulation.
Check metaspace separately If only metaspace rises → classloader leak, not heap object leak. Different tools and fix.
Rule out container limit creep RSS at limit but heap moderate → native/direct/thread stacks; not solved by MAT heap analysis alone.

# Quick live histogram (no full dump) — run twice 30 min apart
jcmd <pid> GC.class_histogram | head -40

# If same class count (e.g. com.app.SessionData) doubled → strong signal

Phase B — Capture evidence

Heap dump #1 — after service warm, baseline traffic: jcmd <pid> GC.heap_dump /dumps/heap-baseline.hprof
Wait 30–60 minutes (or until old gen grows noticeably) at stable load
Heap dump #2 — heap-growth.hprof. Optionally trigger full GC before dump: jcmd <pid> GC.run then dump
Thread dump — same timestamps; count threads, spot custom pool growth
GC logs — confirm old gen post-GC size trend between dumps

In Kubernetes, schedule dumps to a mounted volume or use kubectl exec before pod restart. Automate with cron sidecar only if dumps are rotated—.hprof files are large.

Phase C — Analyze in Eclipse MAT

Open both dumps → Compare Basket (or “Compare to another heap dump”).
Sort by delta retained heap — which types grew most between dumps?
On the growth dump, open Dominator tree for the top class.
Path to GC roots → exclude weak/soft refs first; find strongest path (often static field, thread local, or cache singleton).
Run Leak Suspects Report on the second dump for automated hints.
Inspect Duplicate strings / char[] if leak is many similar keys (bad cache keys).

What the retaining path often looks like

java.lang.Thread
  └─ ThreadLocalMap
       └─ ThreadLocal$Entry
            └─ YourContext.userCache
                 └─ HashMap$Node[] 
                      └─ millions of SessionDto (LEAK)

Lightweight profiling (without full dump)

async-profiler — alloc event in staging reproduces hot allocation sites (where objects are born; leak is where they are kept).
JFR (Java Flight Recorder) — allocation profiling with low overhead in JDK 11+.
Prometheus + Grafana — alert on old gen % rising for 2h while QPS flat.

How — fix, verify, prevent

Fix patterns by root cause

Root cause	Fix
Unbounded map/cache	Caffeine/Guava with `maximumSize`, `expireAfterWrite`; or periodic cleanup job
ThreadLocal	`try/finally { ... } finally { threadLocal.remove(); }` in filters/interceptors
Event listener	`unregister` in `@PreDestroy` / servlet destroy; weak listeners only if semantics allow
Session attributes	Shorter timeout; don’t store large graphs in session; use server-side store with TTL
Classloader leak	Remove reference from singleton; avoid JDBC driver registered from wrong loader; restart on redeploy in dev
Generated classes	Limit dynamic codegen; pool script engines; upgrade libs with known metaspace bugs

Example: bounded cache replacement

// Before: leak under load
private static final Map<String, byte[]> REPORT_CACHE = new ConcurrentHashMap<>();

// After: bounded + TTL
private static final Cache<String, byte[]> REPORT_CACHE = Caffeine.newBuilder()
    .maximumSize(10_000)
    .expireAfterWrite(Duration.ofMinutes(30))
    .build();

Example: ThreadLocal cleanup in a servlet filter

public void doFilter(...) {
  try {
    RequestContext.set(userId);
    chain.doFilter(request, response);
  } finally {
    RequestContext.clear(); // must call ThreadLocal.remove() inside
  }
}

Verify the fix

Deploy to staging with same soak duration (4–8 hours) and flat load generator.
Old gen after GC should plateau, not climb linearly.
Repeat compare-dump test: delta retained heap near zero after plateau.
Load test in CI optional: fail build if old gen > threshold after 1h soak (advanced).

Prevention

Code review checklist: any static Map/List must have documented bounds.
Always clear ThreadLocal in pooled-thread environments.
Dashboard: heap after GC, metaspace, allocation rate, pod restart count.
Alert: old gen > 80% for 30 min while RPS within ±10% of 24h median.
Document max in-memory rows for batch jobs; stream DB results instead of findAll().

When not to chase a “leak”

Heap grew because you legitimately raised cache size in config—verify config diff.
One-off batch job loaded 2M rows—fix job, not JVM; heap drops after job ends.
Only fix is more heap for new product scale—capacity plan, not leak hunt.

Interview one-liner

“I confirm a leak with post-GC heap trending up at flat traffic, take two heap dumps spaced apart, compare retained size in MAT, follow dominators to GC roots, then fix the retaining reference—usually an unbounded cache or ThreadLocal—not just increase -Xmx.”

Related scenarios

OutOfMemoryError — when the JVM runs out of room.
Slow after a few hours — often the first visible symptom of a leak.
Frequent GC pauses — when old gen is full of live objects.
Metaspace growth — classloader leaks.