RSS grows but the Java heap looks stable
Scenario
Dashboards show heap used flat at 60% of -Xmx, but container RSS (resident memory) climbs until the pod is OOMKilled with no OutOfMemoryError in logs. Or top shows the Java process much larger than heap charts suggest. What is eating memory outside the heap?
After reading, you should be able to:
- Map JVM memory regions beyond the Java heap.
- Use Native Memory Tracking (NMT) and container metrics to find the gap.
- Explain why
-Xmxalone does not size a Kubernetes memory limit. - Fix direct buffers, metaspace, threads, and native leaks.
Related: Crash without clear errors, OutOfMemoryError, Memory leak (heap-only leaks).
Why — the JVM uses more than the heap
Heap (-Xmx) holds your Java objects. RSS is physical RAM the OS assigns to the process—it includes heap plus everything else.
Kubernetes cgroup memory limits apply to RSS (and cache pressure), not to “heap used” in JMX.
What lives outside the Java heap
| Region | Controlled by | Grows when |
|---|---|---|
| Metaspace | -XX:MaxMetaspaceSize | Classes loaded, classloader leaks, dynamic proxies |
| Direct / off-heap buffers | -XX:MaxDirectMemorySize (default ≈ -Xmx) | Netty, gRPC, NIO, Kafka clients, large ByteBuffer.allocateDirect |
| Thread stacks | -Xss × thread count | Thousands of threads (pool misconfig, leak) |
| Code cache | -XX:ReservedCodeCacheSize | JIT compilation, many hot methods |
| GC structures | Collector-specific | Large heap, G1 regions |
| JNI / native libraries | Library malloc | Embedded DB, crypto, compression, buggy .so |
| JVM internal / malloc arena | glibc allocator | Native allocations; may not return RAM to OS |
| Mapped files | — | MappedByteBuffer, memory-mapped indexes |
Rough mental model (4 GiB container): -Xmx2g → Java objects (what JMX "heap used" tracks) Metaspace → 150–300 MiB typical Direct memory → 200 MiB – 1 GiB+ (Netty-heavy services) Thread stacks → 500 threads × 1 MiB = 500 MiB Code cache + GC + JVM overhead → 200–400 MiB ───────────────────────────────────────── RSS can exceed 3.5 GiB while "heap" shows 1.2 GiB used
Why RSS keeps climbing while heap is flat
- Native leak — C/C++ library or JNI not freeing; RSS up, heap flat.
- Direct buffer leak — buffers not released; counted in native, not heap.
- Metaspace leak — classloaders retained; metaspace grows unbounded until cap.
- Thread leak — new threads never stopped; stack memory adds up.
- Allocator does not unmap — glibc holds freed native pages; RSS high after spike (may not be leak).
- Wrong limit math — you sized container for
-Xmxonly → OOMKilled.
Do not use heap charts alone for K8s resources.limits.memory. Size for total process RSS with headroom, or use NMT to measure components.
What — measure the gap (in order)
-
Compare three numbers on the same graph
JMX
heap used, processRSS(container_working_set), cgroupmemory.limit. Gap widening over time = off-heap growth. -
Enable Native Memory Tracking (staging first)
-XX:NativeMemoryTracking=summary # or detail for deep dives (more overhead) jcmd <pid> VM.native_memory summary jcmd <pid> VM.native_memory summary.diff
Runbaselinethensummary.diffafter load test to see what grew. -
Read NMT categories
Look for
Internal,Thread,Code,GC,Class(metaspace),Other. Largest delta = investigation target. -
Check direct buffer pools
JMX:
java.nio:type=BufferPool,name=direct— MemoryUsed, Count, TotalCapacity. Netty: leak detection via resource leak detector in staging (-Dio.netty.leakDetection.level=paranoid). -
Thread count over time
jcmd <pid> Thread.print| count lines; JMXThreadCount. Climbing threads → stack RSS climb. -
Metaspace
JMX
Metaspace Used; NMTClassrow. After redeploy without restart → classloader leak suspect. -
pmap / smaps (Linux)
pmap -x <pid> | tail -1 # total RSS cat /proc/<pid>/smaps_rollup
Large anonymous regions outside heap mapping → native or direct. -
Rule out “allocator hoarding”
RSS high but stable after load drops—may be glibc not returning memory to OS. Test with
MALLOC_ARENA_MAX=2or-XX:+UseContainerSupport(default on JDK 10+) in container.
Example NMT summary (interpretation)
Total: reserved=4194304KB, committed=3145728KB - Java Heap (reserved=2097152KB, committed=1572864KB) - Class (reserved=262144KB, committed=180224KB) ← metaspace - Thread (reserved=524288KB, committed=524288KB) ← many threads - Code (reserved=245760KB, committed=120000KB) - GC (reserved=...)
If Java Heap committed is steady but Total committed rises, focus on non-heap rows in summary.diff.
Container vs JVM view
| Metric source | What it shows |
|---|---|
| JMX heap used / max | Java objects only |
container_memory_working_set_bytes | What counts toward K8s limit (≈ RSS) |
| NMT Total committed | JVM’s view of reserved native + heap |
kubectl top pod | Current working set—quick check |
How — fix, size containers, monitor
Size Kubernetes memory limit
memory_limit ≥ Xmx
+ MaxMetaspaceSize (or observed metaspace peak)
+ expected direct memory peak
+ (thread_count × stack_size)
+ 300–500 MiB JVM/GC/misc headroom
Example: -Xmx2g, heavy Netty → use 3–4 GiB limit, not 2 GiB.
Fix by category
| NMT / symptom | Fix |
|---|---|
| Direct memory | Release buffers; cap Netty pools; -XX:MaxDirectMemorySize=512m; fix leak |
| Thread | Reduce pools; fix thread leak; lower -Xss if stacks huge |
| Class / metaspace | Fix classloader leak; set MaxMetaspaceSize; restart on redeploy |
| Internal / Other | JDK upgrade; identify JNI library; async-profiler native alloc |
| Heap flat but RSS at limit | Raise limit or lower off-heap consumers—not raise -Xmx alone |
Flags for production visibility
-XX:+UseContainerSupport -XX:MaxRAMPercentage=75.0 # optional: derive max heap from cgroup -XX:NativeMemoryTracking=summary # staging/debug pods only; small overhead -XX:MaxDirectMemorySize=512m -XX:MaxMetaspaceSize=256m
Do not leave NativeMemoryTracking=detail on all prod pods—use on canary or during incidents.
Monitoring
- Dashboard panel:
RSS - heap_used(“off-heap gap”) trending up = alert. - JMX scrape: direct buffer pool MemoryUsed, Metaspace Used, ThreadCount.
- Alert: working set > 90% of limit while heap < 70% of max → sizing/off-heap issue.
When heap leak tools mislead
MAT heap dumps only show Java heap objects. If RSS grows and heap dumps look fine, switch to NMT diff and direct buffer metrics—not another heap dump.
Interview one-liner
“Heap used is only the object heap; RSS includes metaspace, thread stacks, direct buffers, and native code. I compare container working set to JMX heap, run jcmd VM.native_memory summary.diff, check direct BufferPool and thread count, then size the cgroup limit for total footprint—not just -Xmx.”
Related scenarios
- Crash without clear errors — OOMKilled when RSS hits limit.
- Memory leak — when heap objects are the retainers.
- Metaspace growth — classloader leaks.
- Thread pool exhausted (coming soon)