Frequent GC pauses hurting performance
Scenario
Latency spikes correlate with GC: p99 API time jumps every few seconds, dashboards show long stop-the-world pauses, or users complain the app “stutters.” CPU may be busy in GC threads. How do you diagnose and tune without only throwing more heap at it?
After reading, you should be able to:
- Read unified GC logs and separate young vs full collections.
- Link pauses to allocation rate, heap size, and live set—not guess flags.
- Tune G1 (or choose ZGC/Shenandoah) with measurable before/after metrics.
- Know when pauses mean a memory leak or imminent OOM.
Why — what GC pauses are and what triggers them
Most collectors need stop-the-world (STW) phases: application threads pause while GC marks, relocates, or compacts memory. “Frequent GC pauses” usually means either collections happen too often (many short young GCs) or individual pauses are too long (old-gen / full collections, heap nearly full).
Collectors you see in production (Java 11+)
| Collector | Default? | Pause profile | Typical use |
|---|---|---|---|
| G1 | Yes (server) | Target max pause; mixed GC for old gen | General services, 4–32+ GB heap |
| ZGC | No | Sub-ms to low-ms STW, mostly concurrent | Large heap, strict latency SLO |
| Shenandoah | No | Low pause, concurrent compaction | Similar niche to ZGC |
| Parallel | No | Throughput-oriented; longer STW | Batch, not latency-sensitive APIs |
Root-cause categories
- High allocation rate — young gen fills fast → many minor GCs; CPU spent in GC; “GC overhead” rises.
- Heap too small for live set — old gen fills → frequent mixed/full GCs; each pause reclaims little.
- Memory leak or unbounded cache — live set grows → GC works harder; pauses lengthen until OOM. See memory leak.
- Humongous objects (G1) — arrays > 50% of G1 region size bypass young gen; extra mixed cycles.
- Aggressive pause goal —
-XX:MaxGCPauseMillis=50with huge heap → G1 collects more often, throughput drops. - Full GC events — metaspace, System.gc(), heap exhaustion, or collector fallback → multi-second pauses.
- Oversized heap without need — very large G1 regions can make individual mixed GCs longer (tune region size).
Symptom vs cause: “200 ms pauses” is a symptom. GC logs tell you which phase (Young GC vs Mixed vs Full) and whether the heap is full of garbage or allocation is too high.
What users feel
- API p99/p999 spikes aligned with GC timestamps in logs.
- Thread pools queue up during STW—all workers frozen together.
- Kafka consumers miss
max.poll.intervalif pause exceeds threshold. - Health checks fail if pause > probe timeout.
What — diagnose with logs and metrics (in order)
-
Confirm pauses in metrics
Micrometer/Prometheus:
jvm.gc.pausesum/count, max pause; correlate with HTTP latency histogram. JMX:GarbageCollectorMXBeanCollectionTime. -
Enable unified GC logging (if missing)
Java 11+:
-Xlog:gc*,safepoint:file=/var/log/gc.log:time,uptime,level,tags:filecount=5,filesize=50M -
Classify each pause event
Look for
Pause Young (Normal),Pause Young (Prepare Mixed),Pause Mixed,Pause Full. Full/mixed on a full old gen = sizing or leak. -
Read heap before/after each cycle
Example line pattern:
Heap: 1200M->(800M)1800M— if post-GC old gen never drops, live set is huge or leak. - Calculate allocation rate From logs or JFR: MB/sec allocated. High rate + small young gen = constant minor GC.
- Check GC time % If > 10–15% of wall time in GC at moderate load → tuning or code change needed (not always “add RAM”).
-
Rule out explicit System.gc()
Search codebase and dependencies; RMI DGC and some caches call it. Disable with
-XX:+DisableExplicitGConly if you understand direct-buffer cleanup implications. -
Compare deploy / flag changes
New
-Xmx, collector switch, or JDK upgrade changes pause profile dramatically.
Example GC log lines (G1)
[2026-05-27T10:15:01.234+0000] GC(412) Pause Young (Normal) (G1 Evacuation Pause) [2026-05-27T10:15:01.234+0000] GC(412) Eden regions: 120->0(100) [2026-05-27T10:15:01.245+0000] GC(412) Pause Young (Normal) 1800M->650M(2048M) 11.234ms [2026-05-27T10:18:44.891+0000] GC(418) Pause Mixed [2026-05-27T10:18:44.912+0000] GC(418) Pause Mixed 1950M->1200M(2048M) 421.556ms ← long pause, old gen pressure
Tools
| Tool | Use for |
|---|---|
| GCViewer / GCEasy | Upload gc.log; pause charts, throughput |
| jstat | jstat -gcutil <pid> 1000 live S0/S1/E/O/MU CCS YGC FGC |
| JFR | JDK Mission Control — allocation and GC cause |
| async-profiler | Find hot allocation sites driving GC frequency |
Decision tree (quick)
Long Pause Full GC?
yes → leak / heap too small / metaspace → heap dump, sizing
no → Long Mixed GC?
yes → reduce live set, tune IHOP / region size, or more heap
no → Many short Young GC?
yes → reduce allocation rate or increase young gen (G1 auto) / tune pause goal
How — tune, fix code, choose collector
Step 1 — Reduce allocation (often best ROI)
- Reuse buffers; avoid allocating large
byte[]per request. - Prefer streaming JSON/parsing over loading full body into memory.
- Fix logging at INFO in hot paths (string building even when disabled).
- Profile with JFR/async-profiler
alloc— top frames are your target.
Step 2 — Size the heap correctly
Goal: after full GC, old gen has 30–50% free at steady state. Too small → constant mixed GC; too large → longer pauses and wasted RAM.
# Starting point for G1 service (adjust with logs) -Xms4g -Xmx4g -XX:+UseG1GC -XX:MaxGCPauseMillis=200
Container: keep -Xmx below cgroup limit with headroom (see OOM guide).
Step 3 — G1 tuning knobs (when logs show mixed GC pain)
| Flag | Purpose | Caution |
|---|---|---|
-XX:MaxGCPauseMillis | Soft target for pause length | Too low → excessive GC cycles |
-XX:InitiatingHeapOccupancyPercent (IHOP) | When to start mixed GC (default ~45%) | Lower → earlier mixed GC, more concurrent work |
-XX:G1HeapRegionSize | Region size (1–32 MB) | Affects humongous threshold |
-XX:G1ReservePercent | Reserve against to-space exhaustion | Rarely changed |
Change one knob at a time; re-run load test; compare gc.log pause p99 and application p99.
Step 4 — When to switch collector
- ZGC or Shenandoah — heap multi-GB, p99 latency SLO < 10 ms, team can run on JDK 17+ and validate.
- Example:
-XX:+UseZGC -Xmx8g(flags vary by JDK version; check docs for your release). - Stay on G1 if throughput and simplicity matter more than tail latency.
Step 5 — Fix leaks and humongous objects
- Rising old gen + long mixed GC → leak hunt before more tuning.
- Split large arrays or increase G1 region size so objects are not “humongous.”
Verify improvement
- Baseline: GC pause p99, GC % time, API p99 under fixed load test.
- Apply one change; soak 30+ minutes.
- Accept if GC time < 5% and app p99 improved without raising error rate.
- Document final flags in Helm chart / deployment manifest.
Alerts and SLOs
- Alert: GC pause max > 1 s (warning), > 5 s (critical).
- Alert: old gen > 85% after full GC for 10 min.
- Dashboard: GC cycles/min, allocation rate, heap after GC vs max.
Interview one-liner
“I use GC logs to separate young vs mixed vs full pauses, check post-GC heap and allocation rate, reduce allocations in hot paths, then tune G1 pause goals and heap size—or move to ZGC if tail latency is the SLO—measuring app p99 before and after.”
Related scenarios
- Memory leak — rising old gen and escalating mixed GC.
- OutOfMemoryError — GC overhead limit exceeded.
- Slow after a few hours — triage uptime vs latency first.
- High CPU, low traffic — GC threads consuming CPU (coming soon).