OutOfMemoryError in production

Scenario

Your application suddenly throws java.lang.OutOfMemoryError. Traffic may still be moderate. Pods restart in a loop or one node falls over while others look fine. How do you debug it without guessing?

After reading, you should be able to:

Classify OOM by error message (heap, metaspace, direct buffer, native thread).
Know what to capture before restart destroys evidence.
Distinguish JVM OOM from Kubernetes OOMKilled.
Apply fixes: sizing, leak, batch limits, GC tuning, and guardrails.

Why — what “OutOfMemoryError” actually means

The JVM is not one big malloc bucket. Memory is split into regions. An OOM means a specific region could not satisfy an allocation—not always “you need a bigger -Xmx.”

Error (typical)	Region	Common causes
Java heap space	Heap	Leak, unbounded cache, loading huge collections, spike in live objects
GC overhead limit exceeded	Heap	Heap nearly full; GC spends >98% time reclaiming <2% heap (G1 logs show this path)
Metaspace	Class metadata	Classloader leak, dynamic proxies, Groovy/bytecode frameworks, hot redeploy without restart
Direct buffer memory	Off-heap (NIO)	Netty, gRPC, large `ByteBuffer` use; `-XX:MaxDirectMemorySize` too low
unable to create new native thread	OS / process limits	Thread explosion, small stack + huge thread count, `ulimit` / cgroup pids limit

Sudden vs gradual

Sudden — one huge allocation (multi-GB report export), traffic spike filling session cache, deployment that changed default heap.
Gradual — leak or unbounded growth; heap used % climbs over hours until GC cannot keep up.

Kubernetes OOMKilled is not the same thing

If the container exceeds its memory limit, Linux cgroup kills the process with exit code 137. You may see no Java stack trace—only OOMKilled in kubectl describe pod. The JVM might still think it has headroom if -Xmx plus metaspace, threads, and native memory exceed the container limit.

Rule of thumb: container limit ≥ heap (-Xmx) + metaspace + direct memory + thread stacks + native libs + ~25% headroom. Setting -Xmx equal to the container limit without headroom often causes OOMKilled.

What — check first (in order)

Move fast but preserve evidence. Restarting clears the heap unless dumps are configured.

Read the exact OOM message and stack trace Classify heap vs metaspace vs direct vs threads. Note the allocating class in the top frames (e.g. byte[], HashMap, Netty buffer).
Check if it is JVM OOM or OOMKilled App log: OutOfMemoryError. K8s: Last State: Terminated, Reason: OOMKilled, exit 137. Metrics: container memory at limit.
Correlate time with deploy, traffic, and batch jobs New release? Cron at :00? Marketing email spike? One tenant import?
Grab JVM and GC metrics for the window Heap used vs max, GC pause time, allocation rate, metaspace used. Look for sawtooth that stops reclaiming (leak) vs cliff (spike).
Capture a heap dump (if heap OOM) If the process is still alive: jcmd <pid> GC.heap_dump /tmp/heap.hprof. If already configured: -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/dumps (path on persistent volume in K8s).
Capture thread dump jcmd <pid> Thread.print or jstack — rules out thread explosion and shows what was running during growth.
Enable or pull GC logs Java 11+: -Xlog:gc*:file=/logs/gc.log:time,uptime,level,tags. Shows if GC was thrashing before OOM.
Record JVM flags and container limits jcmd <pid> VM.flags, VM.system_properties; compare -Xmx / -Xms to cgroup memory limit.

What to log in your runbook ticket

Timestamp (UTC):
OOM message:
Pod/node:
Image tag / git SHA:
-Xmx -Xms MaxMetaspaceSize MaxDirectMemorySize:
Container memory request/limit:
Traffic QPS vs baseline:
Heap dump path (yes/no):
GC log snippet (last 50 lines):

Analyze the heap dump (heap OOM)

Open in Eclipse MAT or VisualVM.
Run Leak Suspects Report — dominator tree shows what retains the most bytes.
Sort retained heap by class — common culprits: byte[], char[], collections holding millions of entries, cached DTO maps.
Find GC roots path to the big object — who still references it?
Compare two dumps taken 30 minutes apart (if gradual) — growing dominator = leak.

How — fix now, fix for good, prevent

Immediate mitigation (stop the bleeding)

Situation	Action	Caution
Single bad pod	Restart pod; drain traffic if repeatable	Without dump, root cause may be lost
Traffic-driven spike	Scale replicas; rate-limit at gateway	Does not fix leak; spreads load
Known heavy job	Disable feature flag / pause batch	Communicate to stakeholders
OOMKilled mis-sizing	Raise container limit or lower `-Xmx` with headroom	Do not only raise limit without understanding native use
Emergency heap	Temporary `-Xmx` bump after dump captured	Buys time; may hide leak

Durable fixes by root cause

Memory leak — fix retaining reference (static map, listener not removed, ThreadLocal in pool). Add regression test; heap dump in CI for critical paths optional.
Unbounded cache — cap size (Caffeine maximumSize), TTL, weight by bytes not entry count.
Oversized payloads — stream files; paginate DB; reject uploads above limit at API.
Metaspace — fix classloader lifecycle; increase -XX:MaxMetaspaceSize only after leak ruled out.
Direct memory — tune Netty allocator; set -XX:MaxDirectMemorySize; ensure buffers released.
Thread explosion — bounded thread pools; async with backpressure; fix runaway new Thread().
GC thrashing — increase heap if truly needed; tune G1 (MaxGCPauseMillis, region size); reduce allocation rate in hot path.

JVM flags worth setting in production

-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/dumps/heap.hprof
-XX:+ExitOnOutOfMemoryError
-Xlog:gc*,safepoint:file=/var/log/gc.log:time,uptime,level,tags:filecount=5,filesize=50M

ExitOnOutOfMemoryError lets orchestration restart a unhealthy JVM instead of limping in GC hell. Mount /var/dumps on emptyDir or PVC so dumps survive pod death long enough to copy.

Kubernetes example (memory aligned)

resources:
  requests:
    memory: "2Gi"
  limits:
    memory: "2Gi"
env:
  JAVA_TOOL_OPTIONS: "-Xms1g -Xmx1400m -XX:MaxMetaspaceSize=256m"

~600 MiB headroom for metaspace, threads, direct buffers, and JVM overhead inside a 2 GiB limit.

Prevention and guardrails

Alert on heap used > 85% for 5 min, GC pause p99, metaspace growth, container memory > 90% of limit.
Load test with production-like heap (same -Xmx); soak test 4–8 hours to catch leaks.
Code review rules — no unbounded in-memory lists from DB; close streams; bounded caches documented.
Capacity dashboard — heap after GC vs max, allocation rate, pod restarts (OOMKilled count).

Interview one-liner

“I classify the OOM by message, confirm whether it’s JVM heap or cgroup OOMKilled, capture heap dump and GC logs before restart, analyze dominators in MAT, then fix the retaining path or sizing—not just bump -Xmx.”

Related scenarios

Suspected memory leak — confirm with two heap dumps and MAT dominators.
Frequent GC pauses — tuning after heap is understood.
Crash without clear errors — OOMKilled, exit 137, hs_err_pid.
Kubernetes debugging — pod events and resource limits.