OutOfMemoryError in production
Scenario
Your application suddenly throws java.lang.OutOfMemoryError. Traffic may still be moderate. Pods restart in a loop or one node falls over while others look fine. How do you debug it without guessing?
After reading, you should be able to:
- Classify OOM by error message (heap, metaspace, direct buffer, native thread).
- Know what to capture before restart destroys evidence.
- Distinguish JVM OOM from Kubernetes OOMKilled.
- Apply fixes: sizing, leak, batch limits, GC tuning, and guardrails.
Why — what “OutOfMemoryError” actually means
The JVM is not one big malloc bucket. Memory is split into regions. An OOM means a specific region could not satisfy an allocation—not always “you need a bigger -Xmx.”
| Error (typical) | Region | Common causes |
|---|---|---|
| Java heap space | Heap | Leak, unbounded cache, loading huge collections, spike in live objects |
| GC overhead limit exceeded | Heap | Heap nearly full; GC spends >98% time reclaiming <2% heap (G1 logs show this path) |
| Metaspace | Class metadata | Classloader leak, dynamic proxies, Groovy/bytecode frameworks, hot redeploy without restart |
| Direct buffer memory | Off-heap (NIO) | Netty, gRPC, large ByteBuffer use; -XX:MaxDirectMemorySize too low |
| unable to create new native thread | OS / process limits | Thread explosion, small stack + huge thread count, ulimit / cgroup pids limit |
Sudden vs gradual
- Sudden — one huge allocation (multi-GB report export), traffic spike filling session cache, deployment that changed default heap.
- Gradual — leak or unbounded growth; heap used % climbs over hours until GC cannot keep up.
Kubernetes OOMKilled is not the same thing
If the container exceeds its memory limit, Linux cgroup kills the process with exit code 137. You may see no Java stack trace—only OOMKilled in kubectl describe pod.
The JVM might still think it has headroom if -Xmx plus metaspace, threads, and native memory exceed the container limit.
Rule of thumb: container limit ≥ heap (-Xmx) + metaspace + direct memory + thread stacks + native libs + ~25% headroom. Setting -Xmx equal to the container limit without headroom often causes OOMKilled.
What — check first (in order)
Move fast but preserve evidence. Restarting clears the heap unless dumps are configured.
-
Read the exact OOM message and stack trace
Classify heap vs metaspace vs direct vs threads. Note the allocating class in the top frames (e.g.
byte[],HashMap, Netty buffer). -
Check if it is JVM OOM or OOMKilled
App log:
OutOfMemoryError. K8s:Last State: Terminated, Reason: OOMKilled, exit 137. Metrics: container memory at limit. - Correlate time with deploy, traffic, and batch jobs New release? Cron at :00? Marketing email spike? One tenant import?
- Grab JVM and GC metrics for the window Heap used vs max, GC pause time, allocation rate, metaspace used. Look for sawtooth that stops reclaiming (leak) vs cliff (spike).
-
Capture a heap dump (if heap OOM)
If the process is still alive:
jcmd <pid> GC.heap_dump /tmp/heap.hprof. If already configured:-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/dumps(path on persistent volume in K8s). -
Capture thread dump
jcmd <pid> Thread.printorjstack— rules out thread explosion and shows what was running during growth. -
Enable or pull GC logs
Java 11+:
-Xlog:gc*:file=/logs/gc.log:time,uptime,level,tags. Shows if GC was thrashing before OOM. -
Record JVM flags and container limits
jcmd <pid> VM.flags,VM.system_properties; compare-Xmx/-Xmsto cgroup memory limit.
What to log in your runbook ticket
Timestamp (UTC): OOM message: Pod/node: Image tag / git SHA: -Xmx -Xms MaxMetaspaceSize MaxDirectMemorySize: Container memory request/limit: Traffic QPS vs baseline: Heap dump path (yes/no): GC log snippet (last 50 lines):
Analyze the heap dump (heap OOM)
- Open in Eclipse MAT or VisualVM.
- Run Leak Suspects Report — dominator tree shows what retains the most bytes.
- Sort retained heap by class — common culprits:
byte[],char[], collections holding millions of entries, cached DTO maps. - Find GC roots path to the big object — who still references it?
- Compare two dumps taken 30 minutes apart (if gradual) — growing dominator = leak.
How — fix now, fix for good, prevent
Immediate mitigation (stop the bleeding)
| Situation | Action | Caution |
|---|---|---|
| Single bad pod | Restart pod; drain traffic if repeatable | Without dump, root cause may be lost |
| Traffic-driven spike | Scale replicas; rate-limit at gateway | Does not fix leak; spreads load |
| Known heavy job | Disable feature flag / pause batch | Communicate to stakeholders |
| OOMKilled mis-sizing | Raise container limit or lower -Xmx with headroom | Do not only raise limit without understanding native use |
| Emergency heap | Temporary -Xmx bump after dump captured | Buys time; may hide leak |
Durable fixes by root cause
- Memory leak — fix retaining reference (static map, listener not removed,
ThreadLocalin pool). Add regression test; heap dump in CI for critical paths optional. - Unbounded cache — cap size (Caffeine
maximumSize), TTL, weight by bytes not entry count. - Oversized payloads — stream files; paginate DB; reject uploads above limit at API.
- Metaspace — fix classloader lifecycle; increase
-XX:MaxMetaspaceSizeonly after leak ruled out. - Direct memory — tune Netty allocator; set
-XX:MaxDirectMemorySize; ensure buffers released. - Thread explosion — bounded thread pools; async with backpressure; fix runaway
new Thread(). - GC thrashing — increase heap if truly needed; tune G1 (
MaxGCPauseMillis, region size); reduce allocation rate in hot path.
JVM flags worth setting in production
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/dumps/heap.hprof -XX:+ExitOnOutOfMemoryError -Xlog:gc*,safepoint:file=/var/log/gc.log:time,uptime,level,tags:filecount=5,filesize=50M
ExitOnOutOfMemoryError lets orchestration restart a unhealthy JVM instead of limping in GC hell. Mount /var/dumps on emptyDir or PVC so dumps survive pod death long enough to copy.
Kubernetes example (memory aligned)
resources:
requests:
memory: "2Gi"
limits:
memory: "2Gi"
env:
JAVA_TOOL_OPTIONS: "-Xms1g -Xmx1400m -XX:MaxMetaspaceSize=256m"
~600 MiB headroom for metaspace, threads, direct buffers, and JVM overhead inside a 2 GiB limit.
Prevention and guardrails
- Alert on heap used > 85% for 5 min, GC pause p99, metaspace growth, container memory > 90% of limit.
- Load test with production-like heap (same
-Xmx); soak test 4–8 hours to catch leaks. - Code review rules — no unbounded in-memory lists from DB; close streams; bounded caches documented.
- Capacity dashboard — heap after GC vs max, allocation rate, pod restarts (OOMKilled count).
Interview one-liner
“I classify the OOM by message, confirm whether it’s JVM heap or cgroup OOMKilled, capture heap dump and GC logs before restart, analyze dominators in MAT, then fix the retaining path or sizing—not just bump -Xmx.”
Related scenarios
- Suspected memory leak — confirm with two heap dumps and MAT dominators.
- Frequent GC pauses — tuning after heap is understood.
- Crash without clear errors — OOMKilled, exit 137, hs_err_pid.
- Kubernetes debugging — pod events and resource limits.