Service randomly unresponsive for short windows

Scenario

Monitoring shows healthy averages, but users report brief freezes—timeouts for 5–30 seconds, then recovery. p99 latency spikes while p50 looks fine. The issue does not map to a steady deploy or constant overload. You need causes that produce intermittent, internal stalls inside the JVM or host.

After reading, you should be able to:

List common “micro-outage” causes: GC STW, pool saturation, retries, cgroup throttle, cron spikes.
Correlate latency spikes with JVM, OS, and Kubernetes signals on a timeline.
Capture evidence during spikes (JFR, continuous profiling) when manual dumps miss the window.
Mitigate with tuning, backpressure, and isolation—not only “add replicas.”

Why — short stalls hide in averages

A random unresponsive window means the process (or pod) temporarily cannot accept or complete work fast enough. Clients see timeouts; load balancers may mark instances unhealthy. Because the window is short, mean CPU and mean latency look normal—you must look at max, p99.9, and event-aligned metrics.

Internal causes (inside your service)

Cause	Typical duration	Signal
Stop-the-world GC	100ms – several s	`jvm.gc.pause` spike — see GC pauses
Full GC / heap pressure	Seconds	Heap high before spike; allocation rate drop after
Thread pool full	Until threads free	All workers busy — pool exhausted
Lock convoy	Bursts	Many BLOCKED threads in dump
Scheduled job on request threads	Aligned to cron	Spike at :00, :15; CPU bump
Class loading / JIT deopt	Sub-second to seconds	First traffic after deploy
Safepoint sync	Variable	Long time-to-safepoint in JFR

Infrastructure & dependency causes

CPU throttling (K8s cgroup cpu.cfs_throttled) — JVM threads runnable but starved.
Noisy neighbor on the node — steal time, disk latency.
Retry storm — clients retry together; brief overload amplifies.
DB or cache latency spike — all request threads wait on I/O.
DNS / connection setup — periodic resolver slowness or TLS handshake burst.
Pod restart / readiness flap — liveness too aggressive during GC.
Garbage collection on dependent service — visible as downstream timeout in traces.

Not the same as gradual slowdown. Slow after hours is often leak or queue growth. Random windows point to episodic events: GC, cron, throttle, synchronized bursts.

What — correlate spikes on a timeline

Confirm the symptom shape — latency heatmap or p99.9; error rate blips; timeout logs clustered in 10–60s bands.
Align timestamps (UTC) — app metrics, JVM, node, kube events, deploys, cron schedules, traffic shifts.
Check GC first (fast win)
```
# Prometheus examples
rate(jvm_gc_pause_seconds_sum[1m])
jvm_gc_memory_promoted_bytes_total
```
Pause > 200ms aligned with user reports → tune or heap — GC guide.
CPU throttling vs busy — high container_cpu_cfs_throttled_seconds_total with moderate cpu_usage → limit too low or burst workload.
Thread pool / connections — Tomcat busy = max; Hikari pending > 0 during spike — pool guide.
Distributed trace sample — slow requests during window: one long span (DB? external?) vs gap with no child spans (STW or thread starvation).
Capture during spike (hard but decisive)
- JFR — continuous recording with low overhead; post-incident mark GC, Lock, Thread Park.
- Thread dump — if spike lasts > 10s, automated dump on p99 alert.
- async-profiler — wall-clock profile for 60s when alert fires.
Rule out external “random” — single AZ, single pod, single node? If yes → node or pod issue; if all pods → dependency or deploy.
Health checks — liveness killing pods during long GC? Readiness removing from LB while still starting?

Decision tree (quick)

Spike aligns with jvm_gc_pause?        → GC tuning / heap / collector
Spike aligns with cpu throttled?       → Raise CPU limit or reduce burst work
Spike aligns with cron (:00)?          → Move job off request pool / separate deployment
All threads in JDBC/socket read?       → Dependency slow or pool too small
Gap in trace, no spans, JVM idle CPU?  → STW GC or safepoint / OS freeze
Only one pod / node?                   → Noisy neighbor, disk, hardware

What to log when it happens again

Request id, pod name, node, GC pause at start/end of request (Micrometer timer on GC events).
Active threads, queue depth, Hikari pending.
Downstream dependency latency per call.

How — reduce stalls and prove improvement

Mitigations by cause

Cause	Fix
Long GC pauses	Right-size heap; G1/ZGC tuning; reduce allocation; fix leak
CPU throttle	Increase requests/limits; reduce per-request CPU; spread cron
Pool exhaustion	Timeouts, faster dependencies, bulkheads — pool guide
Retry storm	Exponential backoff + jitter; max retries; 429/503 rate limit
Cron on API threads	`@Scheduled` on dedicated executor; separate worker deployment
Lock bursts	Shrink critical sections — BLOCKED guide
Aggressive liveness	Liveness only on deadlock/hang; readiness on warm + DB
Downstream blips	Circuit breaker, cache, timeout — concurrency design

Scheduled work isolation (example)

@Configuration
class SchedulerConfig {
  @Bean
  TaskScheduler batchScheduler() {
    ThreadPoolTaskScheduler s = new ThreadPoolTaskScheduler();
    s.setPoolSize(2);
    s.setThreadNamePrefix("batch-");
    return s;
  }
}
// Do NOT run heavy reports on http-nio-* threads

Observability upgrades

Alert on p99.9 and GC pause max, not only averages.
SLO burn rate alerts for 5-minute windows.
JFR enabled in staging always; sample 1% prod if policy allows.
Runbook: thread dump + JFR export when “latency spike” alert fires.

Verify

After fix, same traffic pattern: no p99 spikes above SLO for 7 days.
GC max pause < target (e.g. 50ms for interactive APIs).
Chaos: induce dependency slow path; service sheds load, does not freeze all threads.

Interview one-liner

“Short unresponsive windows usually show up in p99 and GC or throttle metrics, not the mean. I line up spikes with GC pauses, pool saturation, cron, and retries, use JFR or dumps during the spike, then fix the episodic cause—tuning GC, isolating batch work, or adding backpressure.”

Why — short stalls hide in averages

Internal causes (inside your service)

Infrastructure & dependency causes

What — correlate spikes on a timeline

Decision tree (quick)

What to log when it happens again

How — reduce stalls and prove improvement

Mitigations by cause

Scheduled work isolation (example)

Observability upgrades

Verify

Interview one-liner

Related scenarios