OutOfMemoryError in production

Scenario

Your application suddenly throws java.lang.OutOfMemoryError. Traffic may still be moderate. Pods restart in a loop or one node falls over while others look fine. How do you debug it without guessing?

After reading, you should be able to:

Why — what “OutOfMemoryError” actually means

The JVM is not one big malloc bucket. Memory is split into regions. An OOM means a specific region could not satisfy an allocation—not always “you need a bigger -Xmx.”

Error (typical)RegionCommon causes
Java heap space Heap Leak, unbounded cache, loading huge collections, spike in live objects
GC overhead limit exceeded Heap Heap nearly full; GC spends >98% time reclaiming <2% heap (G1 logs show this path)
Metaspace Class metadata Classloader leak, dynamic proxies, Groovy/bytecode frameworks, hot redeploy without restart
Direct buffer memory Off-heap (NIO) Netty, gRPC, large ByteBuffer use; -XX:MaxDirectMemorySize too low
unable to create new native thread OS / process limits Thread explosion, small stack + huge thread count, ulimit / cgroup pids limit

Sudden vs gradual

Kubernetes OOMKilled is not the same thing

If the container exceeds its memory limit, Linux cgroup kills the process with exit code 137. You may see no Java stack trace—only OOMKilled in kubectl describe pod. The JVM might still think it has headroom if -Xmx plus metaspace, threads, and native memory exceed the container limit.

Rule of thumb: container limit ≥ heap (-Xmx) + metaspace + direct memory + thread stacks + native libs + ~25% headroom. Setting -Xmx equal to the container limit without headroom often causes OOMKilled.

What — check first (in order)

Move fast but preserve evidence. Restarting clears the heap unless dumps are configured.

  1. Read the exact OOM message and stack trace Classify heap vs metaspace vs direct vs threads. Note the allocating class in the top frames (e.g. byte[], HashMap, Netty buffer).
  2. Check if it is JVM OOM or OOMKilled App log: OutOfMemoryError. K8s: Last State: Terminated, Reason: OOMKilled, exit 137. Metrics: container memory at limit.
  3. Correlate time with deploy, traffic, and batch jobs New release? Cron at :00? Marketing email spike? One tenant import?
  4. Grab JVM and GC metrics for the window Heap used vs max, GC pause time, allocation rate, metaspace used. Look for sawtooth that stops reclaiming (leak) vs cliff (spike).
  5. Capture a heap dump (if heap OOM) If the process is still alive: jcmd <pid> GC.heap_dump /tmp/heap.hprof. If already configured: -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/dumps (path on persistent volume in K8s).
  6. Capture thread dump jcmd <pid> Thread.print or jstack — rules out thread explosion and shows what was running during growth.
  7. Enable or pull GC logs Java 11+: -Xlog:gc*:file=/logs/gc.log:time,uptime,level,tags. Shows if GC was thrashing before OOM.
  8. Record JVM flags and container limits jcmd <pid> VM.flags, VM.system_properties; compare -Xmx / -Xms to cgroup memory limit.

What to log in your runbook ticket

Timestamp (UTC):
OOM message:
Pod/node:
Image tag / git SHA:
-Xmx -Xms MaxMetaspaceSize MaxDirectMemorySize:
Container memory request/limit:
Traffic QPS vs baseline:
Heap dump path (yes/no):
GC log snippet (last 50 lines):

Analyze the heap dump (heap OOM)

  1. Open in Eclipse MAT or VisualVM.
  2. Run Leak Suspects Report — dominator tree shows what retains the most bytes.
  3. Sort retained heap by class — common culprits: byte[], char[], collections holding millions of entries, cached DTO maps.
  4. Find GC roots path to the big object — who still references it?
  5. Compare two dumps taken 30 minutes apart (if gradual) — growing dominator = leak.

How — fix now, fix for good, prevent

Immediate mitigation (stop the bleeding)

SituationActionCaution
Single bad podRestart pod; drain traffic if repeatableWithout dump, root cause may be lost
Traffic-driven spikeScale replicas; rate-limit at gatewayDoes not fix leak; spreads load
Known heavy jobDisable feature flag / pause batchCommunicate to stakeholders
OOMKilled mis-sizingRaise container limit or lower -Xmx with headroomDo not only raise limit without understanding native use
Emergency heapTemporary -Xmx bump after dump capturedBuys time; may hide leak

Durable fixes by root cause

JVM flags worth setting in production

-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/dumps/heap.hprof
-XX:+ExitOnOutOfMemoryError
-Xlog:gc*,safepoint:file=/var/log/gc.log:time,uptime,level,tags:filecount=5,filesize=50M

ExitOnOutOfMemoryError lets orchestration restart a unhealthy JVM instead of limping in GC hell. Mount /var/dumps on emptyDir or PVC so dumps survive pod death long enough to copy.

Kubernetes example (memory aligned)

resources:
  requests:
    memory: "2Gi"
  limits:
    memory: "2Gi"
env:
  JAVA_TOOL_OPTIONS: "-Xms1g -Xmx1400m -XX:MaxMetaspaceSize=256m"

~600 MiB headroom for metaspace, threads, direct buffers, and JVM overhead inside a 2 GiB limit.

Prevention and guardrails

Interview one-liner

“I classify the OOM by message, confirm whether it’s JVM heap or cgroup OOMKilled, capture heap dump and GC logs before restart, analyze dominators in MAT, then fix the retaining path or sizing—not just bump -Xmx.”

Related scenarios