Service randomly unresponsive for short windows
Scenario
Monitoring shows healthy averages, but users report brief freezes—timeouts for 5–30 seconds, then recovery. p99 latency spikes while p50 looks fine. The issue does not map to a steady deploy or constant overload. You need causes that produce intermittent, internal stalls inside the JVM or host.
After reading, you should be able to:
- List common “micro-outage” causes: GC STW, pool saturation, retries, cgroup throttle, cron spikes.
- Correlate latency spikes with JVM, OS, and Kubernetes signals on a timeline.
- Capture evidence during spikes (JFR, continuous profiling) when manual dumps miss the window.
- Mitigate with tuning, backpressure, and isolation—not only “add replicas.”
Why — short stalls hide in averages
A random unresponsive window means the process (or pod) temporarily cannot accept or complete work fast enough. Clients see timeouts; load balancers may mark instances unhealthy. Because the window is short, mean CPU and mean latency look normal—you must look at max, p99.9, and event-aligned metrics.
Internal causes (inside your service)
| Cause | Typical duration | Signal |
|---|---|---|
| Stop-the-world GC | 100ms – several s | jvm.gc.pause spike — see GC pauses |
| Full GC / heap pressure | Seconds | Heap high before spike; allocation rate drop after |
| Thread pool full | Until threads free | All workers busy — pool exhausted |
| Lock convoy | Bursts | Many BLOCKED threads in dump |
| Scheduled job on request threads | Aligned to cron | Spike at :00, :15; CPU bump |
| Class loading / JIT deopt | Sub-second to seconds | First traffic after deploy |
| Safepoint sync | Variable | Long time-to-safepoint in JFR |
Infrastructure & dependency causes
- CPU throttling (K8s cgroup
cpu.cfs_throttled) — JVM threads runnable but starved. - Noisy neighbor on the node — steal time, disk latency.
- Retry storm — clients retry together; brief overload amplifies.
- DB or cache latency spike — all request threads wait on I/O.
- DNS / connection setup — periodic resolver slowness or TLS handshake burst.
- Pod restart / readiness flap — liveness too aggressive during GC.
- Garbage collection on dependent service — visible as downstream timeout in traces.
Not the same as gradual slowdown. Slow after hours is often leak or queue growth. Random windows point to episodic events: GC, cron, throttle, synchronized bursts.
What — correlate spikes on a timeline
- Confirm the symptom shape — latency heatmap or p99.9; error rate blips; timeout logs clustered in 10–60s bands.
- Align timestamps (UTC) — app metrics, JVM, node, kube events, deploys, cron schedules, traffic shifts.
-
Check GC first (fast win)
# Prometheus examples rate(jvm_gc_pause_seconds_sum[1m]) jvm_gc_memory_promoted_bytes_total
Pause > 200ms aligned with user reports → tune or heap — GC guide. -
CPU throttling vs busy
— high
container_cpu_cfs_throttled_seconds_totalwith moderatecpu_usage→ limit too low or burst workload. - Thread pool / connections — Tomcat busy = max; Hikari pending > 0 during spike — pool guide.
- Distributed trace sample — slow requests during window: one long span (DB? external?) vs gap with no child spans (STW or thread starvation).
-
Capture during spike (hard but decisive)
- JFR — continuous recording with low overhead; post-incident mark GC, Lock, Thread Park.
- Thread dump — if spike lasts > 10s, automated dump on p99 alert.
- async-profiler — wall-clock profile for 60s when alert fires.
- Rule out external “random” — single AZ, single pod, single node? If yes → node or pod issue; if all pods → dependency or deploy.
- Health checks — liveness killing pods during long GC? Readiness removing from LB while still starting?
Decision tree (quick)
Spike aligns with jvm_gc_pause? → GC tuning / heap / collector Spike aligns with cpu throttled? → Raise CPU limit or reduce burst work Spike aligns with cron (:00)? → Move job off request pool / separate deployment All threads in JDBC/socket read? → Dependency slow or pool too small Gap in trace, no spans, JVM idle CPU? → STW GC or safepoint / OS freeze Only one pod / node? → Noisy neighbor, disk, hardware
What to log when it happens again
- Request id, pod name, node, GC pause at start/end of request (Micrometer timer on GC events).
- Active threads, queue depth, Hikari pending.
- Downstream dependency latency per call.
How — reduce stalls and prove improvement
Mitigations by cause
| Cause | Fix |
|---|---|
| Long GC pauses | Right-size heap; G1/ZGC tuning; reduce allocation; fix leak |
| CPU throttle | Increase requests/limits; reduce per-request CPU; spread cron |
| Pool exhaustion | Timeouts, faster dependencies, bulkheads — pool guide |
| Retry storm | Exponential backoff + jitter; max retries; 429/503 rate limit |
| Cron on API threads | @Scheduled on dedicated executor; separate worker deployment |
| Lock bursts | Shrink critical sections — BLOCKED guide |
| Aggressive liveness | Liveness only on deadlock/hang; readiness on warm + DB |
| Downstream blips | Circuit breaker, cache, timeout — concurrency design |
Scheduled work isolation (example)
@Configuration
class SchedulerConfig {
@Bean
TaskScheduler batchScheduler() {
ThreadPoolTaskScheduler s = new ThreadPoolTaskScheduler();
s.setPoolSize(2);
s.setThreadNamePrefix("batch-");
return s;
}
}
// Do NOT run heavy reports on http-nio-* threads
Observability upgrades
- Alert on p99.9 and GC pause max, not only averages.
- SLO burn rate alerts for 5-minute windows.
- JFR enabled in staging always; sample 1% prod if policy allows.
- Runbook: thread dump + JFR export when “latency spike” alert fires.
Verify
- After fix, same traffic pattern: no p99 spikes above SLO for 7 days.
- GC max pause < target (e.g. 50ms for interactive APIs).
- Chaos: induce dependency slow path; service sheds load, does not freeze all threads.
Interview one-liner
“Short unresponsive windows usually show up in p99 and GC or throttle metrics, not the mean. I line up spikes with GC pauses, pool saturation, cron, and retries, use JFR or dumps during the spike, then fix the episodic cause—tuning GC, isolating batch work, or adding backpressure.”