Frequent GC pauses hurting performance

Scenario

Latency spikes correlate with GC: p99 API time jumps every few seconds, dashboards show long stop-the-world pauses, or users complain the app “stutters.” CPU may be busy in GC threads. How do you diagnose and tune without only throwing more heap at it?

After reading, you should be able to:

Read unified GC logs and separate young vs full collections.
Link pauses to allocation rate, heap size, and live set—not guess flags.
Tune G1 (or choose ZGC/Shenandoah) with measurable before/after metrics.
Know when pauses mean a memory leak or imminent OOM.

Why — what GC pauses are and what triggers them

Most collectors need stop-the-world (STW) phases: application threads pause while GC marks, relocates, or compacts memory. “Frequent GC pauses” usually means either collections happen too often (many short young GCs) or individual pauses are too long (old-gen / full collections, heap nearly full).

Collectors you see in production (Java 11+)

Collector	Default?	Pause profile	Typical use
G1	Yes (server)	Target max pause; mixed GC for old gen	General services, 4–32+ GB heap
ZGC	No	Sub-ms to low-ms STW, mostly concurrent	Large heap, strict latency SLO
Shenandoah	No	Low pause, concurrent compaction	Similar niche to ZGC
Parallel	No	Throughput-oriented; longer STW	Batch, not latency-sensitive APIs

Root-cause categories

High allocation rate — young gen fills fast → many minor GCs; CPU spent in GC; “GC overhead” rises.
Heap too small for live set — old gen fills → frequent mixed/full GCs; each pause reclaims little.
Memory leak or unbounded cache — live set grows → GC works harder; pauses lengthen until OOM. See memory leak.
Humongous objects (G1) — arrays > 50% of G1 region size bypass young gen; extra mixed cycles.
Aggressive pause goal — -XX:MaxGCPauseMillis=50 with huge heap → G1 collects more often, throughput drops.
Full GC events — metaspace, System.gc(), heap exhaustion, or collector fallback → multi-second pauses.
Oversized heap without need — very large G1 regions can make individual mixed GCs longer (tune region size).

Symptom vs cause: “200 ms pauses” is a symptom. GC logs tell you which phase (Young GC vs Mixed vs Full) and whether the heap is full of garbage or allocation is too high.

What users feel

API p99/p999 spikes aligned with GC timestamps in logs.
Thread pools queue up during STW—all workers frozen together.
Kafka consumers miss max.poll.interval if pause exceeds threshold.
Health checks fail if pause > probe timeout.

What — diagnose with logs and metrics (in order)

Confirm pauses in metrics Micrometer/Prometheus: jvm.gc.pause sum/count, max pause; correlate with HTTP latency histogram. JMX: GarbageCollectorMXBean CollectionTime.
Enable unified GC logging (if missing) Java 11+: -Xlog:gc*,safepoint:file=/var/log/gc.log:time,uptime,level,tags:filecount=5,filesize=50M
Classify each pause event Look for Pause Young (Normal), Pause Young (Prepare Mixed), Pause Mixed, Pause Full. Full/mixed on a full old gen = sizing or leak.
Read heap before/after each cycle Example line pattern: Heap: 1200M->(800M)1800M — if post-GC old gen never drops, live set is huge or leak.
Calculate allocation rate From logs or JFR: MB/sec allocated. High rate + small young gen = constant minor GC.
Check GC time % If > 10–15% of wall time in GC at moderate load → tuning or code change needed (not always “add RAM”).
Rule out explicit System.gc() Search codebase and dependencies; RMI DGC and some caches call it. Disable with -XX:+DisableExplicitGC only if you understand direct-buffer cleanup implications.
Compare deploy / flag changes New -Xmx, collector switch, or JDK upgrade changes pause profile dramatically.

Example GC log lines (G1)

[2026-05-27T10:15:01.234+0000] GC(412) Pause Young (Normal) (G1 Evacuation Pause)
[2026-05-27T10:15:01.234+0000] GC(412) Eden regions: 120->0(100)
[2026-05-27T10:15:01.245+0000] GC(412) Pause Young (Normal) 1800M->650M(2048M) 11.234ms

[2026-05-27T10:18:44.891+0000] GC(418) Pause Mixed
[2026-05-27T10:18:44.912+0000] GC(418) Pause Mixed 1950M->1200M(2048M) 421.556ms   ← long pause, old gen pressure

Tools

Tool	Use for
GCViewer / GCEasy	Upload gc.log; pause charts, throughput
jstat	`jstat -gcutil <pid> 1000` live S0/S1/E/O/MU CCS YGC FGC
JFR	JDK Mission Control — allocation and GC cause
async-profiler	Find hot allocation sites driving GC frequency

Decision tree (quick)

Long Pause Full GC?
  yes → leak / heap too small / metaspace → heap dump, sizing
  no → Long Mixed GC?
    yes → reduce live set, tune IHOP / region size, or more heap
    no → Many short Young GC?
      yes → reduce allocation rate or increase young gen (G1 auto) / tune pause goal

How — tune, fix code, choose collector

Step 1 — Reduce allocation (often best ROI)

Reuse buffers; avoid allocating large byte[] per request.
Prefer streaming JSON/parsing over loading full body into memory.
Fix logging at INFO in hot paths (string building even when disabled).
Profile with JFR/async-profiler alloc — top frames are your target.

Step 2 — Size the heap correctly

Goal: after full GC, old gen has 30–50% free at steady state. Too small → constant mixed GC; too large → longer pauses and wasted RAM.

# Starting point for G1 service (adjust with logs)
-Xms4g -Xmx4g
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200

Container: keep -Xmx below cgroup limit with headroom (see OOM guide).

Step 3 — G1 tuning knobs (when logs show mixed GC pain)

Flag	Purpose	Caution
`-XX:MaxGCPauseMillis`	Soft target for pause length	Too low → excessive GC cycles
`-XX:InitiatingHeapOccupancyPercent` (IHOP)	When to start mixed GC (default ~45%)	Lower → earlier mixed GC, more concurrent work
`-XX:G1HeapRegionSize`	Region size (1–32 MB)	Affects humongous threshold
`-XX:G1ReservePercent`	Reserve against to-space exhaustion	Rarely changed

Change one knob at a time; re-run load test; compare gc.log pause p99 and application p99.

Step 4 — When to switch collector

ZGC or Shenandoah — heap multi-GB, p99 latency SLO < 10 ms, team can run on JDK 17+ and validate.
Example: -XX:+UseZGC -Xmx8g (flags vary by JDK version; check docs for your release).
Stay on G1 if throughput and simplicity matter more than tail latency.

Step 5 — Fix leaks and humongous objects

Rising old gen + long mixed GC → leak hunt before more tuning.
Split large arrays or increase G1 region size so objects are not “humongous.”

Verify improvement

Baseline: GC pause p99, GC % time, API p99 under fixed load test.
Apply one change; soak 30+ minutes.
Accept if GC time < 5% and app p99 improved without raising error rate.
Document final flags in Helm chart / deployment manifest.

Alerts and SLOs

Alert: GC pause max > 1 s (warning), > 5 s (critical).
Alert: old gen > 85% after full GC for 10 min.
Dashboard: GC cycles/min, allocation rate, heap after GC vs max.

Interview one-liner

“I use GC logs to separate young vs mixed vs full pauses, check post-GC heap and allocation rate, reduce allocations in hot paths, then tune G1 pause goals and heap size—or move to ZGC if tail latency is the SLO—measuring app p99 before and after.”

Related scenarios

Memory leak — rising old gen and escalating mixed GC.
OutOfMemoryError — GC overhead limit exceeded.
Slow after a few hours — triage uptime vs latency first.
High CPU, low traffic — GC threads consuming CPU (coming soon).