Memory leak in production

Scenario

You suspect a memory leak: heap usage climbs over hours, GC runs more often but reclaim less, and eventually you hit OutOfMemoryError or the pod is OOMKilled. Traffic is flat. How do you prove it is a leak—and find what is holding the memory?

After reading, you should be able to:

Related: OutOfMemoryError in production (when the JVM finally fails).

Why — what a “memory leak” is in Java

Java does not leak memory the way C does with forgotten free(). Objects become eligible for GC when no GC root can reach them. A “leak” means your code still holds references—directly or indirectly—so objects that should be dead stay alive forever (or until redeploy).

Leak vs not a leak

PatternHeap after GCUsually
LeakOld gen baseline keeps risingUnbounded collection, missing cleanup
Legitimate growthRises then plateausCache fills to configured max; new feature holds more per request
Traffic-correlatedTracks QPS, drops when quietSession map, request-scoped data not leak if cleared
Metaspace leakHeap OK; metaspace climbsClassloader not collected (hot redeploy, OSGi-style plugins)

Common retaining causes

Mental model: A leak is a reference management bug, not “GC is broken.” Fix who points at the object, not only heap size.

What — confirm the leak, then find the holder

Phase A — Is it really a leak?

  1. Plot heap used after full GC over 24h Stair-step up with no plateau at steady traffic → suspect leak. Use JMX/Micrometer jvm.memory.used (old gen) or GC log “heap after GC” lines.
  2. Compare to traffic and deploys Step change at deploy → config or code change. Linear growth with flat QPS → leak. Growth only during peak → may be request-scoped accumulation.
  3. Check metaspace separately If only metaspace rises → classloader leak, not heap object leak. Different tools and fix.
  4. Rule out container limit creep RSS at limit but heap moderate → native/direct/thread stacks; not solved by MAT heap analysis alone.
# Quick live histogram (no full dump) — run twice 30 min apart
jcmd <pid> GC.class_histogram | head -40

# If same class count (e.g. com.app.SessionData) doubled → strong signal

Phase B — Capture evidence

  1. Heap dump #1 — after service warm, baseline traffic: jcmd <pid> GC.heap_dump /dumps/heap-baseline.hprof
  2. Wait 30–60 minutes (or until old gen grows noticeably) at stable load
  3. Heap dump #2heap-growth.hprof. Optionally trigger full GC before dump: jcmd <pid> GC.run then dump
  4. Thread dump — same timestamps; count threads, spot custom pool growth
  5. GC logs — confirm old gen post-GC size trend between dumps

In Kubernetes, schedule dumps to a mounted volume or use kubectl exec before pod restart. Automate with cron sidecar only if dumps are rotated—.hprof files are large.

Phase C — Analyze in Eclipse MAT

  1. Open both dumps → Compare Basket (or “Compare to another heap dump”).
  2. Sort by delta retained heap — which types grew most between dumps?
  3. On the growth dump, open Dominator tree for the top class.
  4. Path to GC roots → exclude weak/soft refs first; find strongest path (often static field, thread local, or cache singleton).
  5. Run Leak Suspects Report on the second dump for automated hints.
  6. Inspect Duplicate strings / char[] if leak is many similar keys (bad cache keys).

What the retaining path often looks like

java.lang.Thread
  └─ ThreadLocalMap
       └─ ThreadLocal$Entry
            └─ YourContext.userCache
                 └─ HashMap$Node[] 
                      └─ millions of SessionDto (LEAK)

Lightweight profiling (without full dump)

How — fix, verify, prevent

Fix patterns by root cause

Root causeFix
Unbounded map/cacheCaffeine/Guava with maximumSize, expireAfterWrite; or periodic cleanup job
ThreadLocaltry/finally { ... } finally { threadLocal.remove(); } in filters/interceptors
Event listenerunregister in @PreDestroy / servlet destroy; weak listeners only if semantics allow
Session attributesShorter timeout; don’t store large graphs in session; use server-side store with TTL
Classloader leakRemove reference from singleton; avoid JDBC driver registered from wrong loader; restart on redeploy in dev
Generated classesLimit dynamic codegen; pool script engines; upgrade libs with known metaspace bugs

Example: bounded cache replacement

// Before: leak under load
private static final Map<String, byte[]> REPORT_CACHE = new ConcurrentHashMap<>();

// After: bounded + TTL
private static final Cache<String, byte[]> REPORT_CACHE = Caffeine.newBuilder()
    .maximumSize(10_000)
    .expireAfterWrite(Duration.ofMinutes(30))
    .build();

Example: ThreadLocal cleanup in a servlet filter

public void doFilter(...) {
  try {
    RequestContext.set(userId);
    chain.doFilter(request, response);
  } finally {
    RequestContext.clear(); // must call ThreadLocal.remove() inside
  }
}

Verify the fix

  1. Deploy to staging with same soak duration (4–8 hours) and flat load generator.
  2. Old gen after GC should plateau, not climb linearly.
  3. Repeat compare-dump test: delta retained heap near zero after plateau.
  4. Load test in CI optional: fail build if old gen > threshold after 1h soak (advanced).

Prevention

When not to chase a “leak”

Interview one-liner

“I confirm a leak with post-GC heap trending up at flat traffic, take two heap dumps spaced apart, compare retained size in MAT, follow dominators to GC roots, then fix the retaining reference—usually an unbounded cache or ThreadLocal—not just increase -Xmx.”

Related scenarios