Frequent GC pauses hurting performance

Scenario

Latency spikes correlate with GC: p99 API time jumps every few seconds, dashboards show long stop-the-world pauses, or users complain the app “stutters.” CPU may be busy in GC threads. How do you diagnose and tune without only throwing more heap at it?

After reading, you should be able to:

Why — what GC pauses are and what triggers them

Most collectors need stop-the-world (STW) phases: application threads pause while GC marks, relocates, or compacts memory. “Frequent GC pauses” usually means either collections happen too often (many short young GCs) or individual pauses are too long (old-gen / full collections, heap nearly full).

Collectors you see in production (Java 11+)

CollectorDefault?Pause profileTypical use
G1Yes (server)Target max pause; mixed GC for old genGeneral services, 4–32+ GB heap
ZGCNoSub-ms to low-ms STW, mostly concurrentLarge heap, strict latency SLO
ShenandoahNoLow pause, concurrent compactionSimilar niche to ZGC
ParallelNoThroughput-oriented; longer STWBatch, not latency-sensitive APIs

Root-cause categories

Symptom vs cause: “200 ms pauses” is a symptom. GC logs tell you which phase (Young GC vs Mixed vs Full) and whether the heap is full of garbage or allocation is too high.

What users feel

What — diagnose with logs and metrics (in order)

  1. Confirm pauses in metrics Micrometer/Prometheus: jvm.gc.pause sum/count, max pause; correlate with HTTP latency histogram. JMX: GarbageCollectorMXBean CollectionTime.
  2. Enable unified GC logging (if missing) Java 11+: -Xlog:gc*,safepoint:file=/var/log/gc.log:time,uptime,level,tags:filecount=5,filesize=50M
  3. Classify each pause event Look for Pause Young (Normal), Pause Young (Prepare Mixed), Pause Mixed, Pause Full. Full/mixed on a full old gen = sizing or leak.
  4. Read heap before/after each cycle Example line pattern: Heap: 1200M->(800M)1800M — if post-GC old gen never drops, live set is huge or leak.
  5. Calculate allocation rate From logs or JFR: MB/sec allocated. High rate + small young gen = constant minor GC.
  6. Check GC time % If > 10–15% of wall time in GC at moderate load → tuning or code change needed (not always “add RAM”).
  7. Rule out explicit System.gc() Search codebase and dependencies; RMI DGC and some caches call it. Disable with -XX:+DisableExplicitGC only if you understand direct-buffer cleanup implications.
  8. Compare deploy / flag changes New -Xmx, collector switch, or JDK upgrade changes pause profile dramatically.

Example GC log lines (G1)

[2026-05-27T10:15:01.234+0000] GC(412) Pause Young (Normal) (G1 Evacuation Pause)
[2026-05-27T10:15:01.234+0000] GC(412) Eden regions: 120->0(100)
[2026-05-27T10:15:01.245+0000] GC(412) Pause Young (Normal) 1800M->650M(2048M) 11.234ms

[2026-05-27T10:18:44.891+0000] GC(418) Pause Mixed
[2026-05-27T10:18:44.912+0000] GC(418) Pause Mixed 1950M->1200M(2048M) 421.556ms   ← long pause, old gen pressure

Tools

ToolUse for
GCViewer / GCEasyUpload gc.log; pause charts, throughput
jstatjstat -gcutil <pid> 1000 live S0/S1/E/O/MU CCS YGC FGC
JFRJDK Mission Control — allocation and GC cause
async-profilerFind hot allocation sites driving GC frequency

Decision tree (quick)

Long Pause Full GC?
  yes → leak / heap too small / metaspace → heap dump, sizing
  no → Long Mixed GC?
    yes → reduce live set, tune IHOP / region size, or more heap
    no → Many short Young GC?
      yes → reduce allocation rate or increase young gen (G1 auto) / tune pause goal

How — tune, fix code, choose collector

Step 1 — Reduce allocation (often best ROI)

Step 2 — Size the heap correctly

Goal: after full GC, old gen has 30–50% free at steady state. Too small → constant mixed GC; too large → longer pauses and wasted RAM.

# Starting point for G1 service (adjust with logs)
-Xms4g -Xmx4g
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200

Container: keep -Xmx below cgroup limit with headroom (see OOM guide).

Step 3 — G1 tuning knobs (when logs show mixed GC pain)

FlagPurposeCaution
-XX:MaxGCPauseMillisSoft target for pause lengthToo low → excessive GC cycles
-XX:InitiatingHeapOccupancyPercent (IHOP)When to start mixed GC (default ~45%)Lower → earlier mixed GC, more concurrent work
-XX:G1HeapRegionSizeRegion size (1–32 MB)Affects humongous threshold
-XX:G1ReservePercentReserve against to-space exhaustionRarely changed

Change one knob at a time; re-run load test; compare gc.log pause p99 and application p99.

Step 4 — When to switch collector

Step 5 — Fix leaks and humongous objects

Verify improvement

  1. Baseline: GC pause p99, GC % time, API p99 under fixed load test.
  2. Apply one change; soak 30+ minutes.
  3. Accept if GC time < 5% and app p99 improved without raising error rate.
  4. Document final flags in Helm chart / deployment manifest.

Alerts and SLOs

Interview one-liner

“I use GC logs to separate young vs mixed vs full pauses, check post-GC heap and allocation rate, reduce allocations in hot paths, then tune G1 pause goals and heap size—or move to ZGC if tail latency is the SLO—measuring app p99 before and after.”

Related scenarios