Service randomly unresponsive for short windows

Scenario

Monitoring shows healthy averages, but users report brief freezes—timeouts for 5–30 seconds, then recovery. p99 latency spikes while p50 looks fine. The issue does not map to a steady deploy or constant overload. You need causes that produce intermittent, internal stalls inside the JVM or host.

After reading, you should be able to:

Why — short stalls hide in averages

A random unresponsive window means the process (or pod) temporarily cannot accept or complete work fast enough. Clients see timeouts; load balancers may mark instances unhealthy. Because the window is short, mean CPU and mean latency look normal—you must look at max, p99.9, and event-aligned metrics.

Internal causes (inside your service)

CauseTypical durationSignal
Stop-the-world GC100ms – several sjvm.gc.pause spike — see GC pauses
Full GC / heap pressureSecondsHeap high before spike; allocation rate drop after
Thread pool fullUntil threads freeAll workers busy — pool exhausted
Lock convoyBurstsMany BLOCKED threads in dump
Scheduled job on request threadsAligned to cronSpike at :00, :15; CPU bump
Class loading / JIT deoptSub-second to secondsFirst traffic after deploy
Safepoint syncVariableLong time-to-safepoint in JFR

Infrastructure & dependency causes

Not the same as gradual slowdown. Slow after hours is often leak or queue growth. Random windows point to episodic events: GC, cron, throttle, synchronized bursts.

What — correlate spikes on a timeline

  1. Confirm the symptom shape — latency heatmap or p99.9; error rate blips; timeout logs clustered in 10–60s bands.
  2. Align timestamps (UTC) — app metrics, JVM, node, kube events, deploys, cron schedules, traffic shifts.
  3. Check GC first (fast win)
    # Prometheus examples
    rate(jvm_gc_pause_seconds_sum[1m])
    jvm_gc_memory_promoted_bytes_total
    Pause > 200ms aligned with user reports → tune or heap — GC guide.
  4. CPU throttling vs busy — high container_cpu_cfs_throttled_seconds_total with moderate cpu_usage → limit too low or burst workload.
  5. Thread pool / connections — Tomcat busy = max; Hikari pending > 0 during spike — pool guide.
  6. Distributed trace sample — slow requests during window: one long span (DB? external?) vs gap with no child spans (STW or thread starvation).
  7. Capture during spike (hard but decisive)
    • JFR — continuous recording with low overhead; post-incident mark GC, Lock, Thread Park.
    • Thread dump — if spike lasts > 10s, automated dump on p99 alert.
    • async-profiler — wall-clock profile for 60s when alert fires.
  8. Rule out external “random” — single AZ, single pod, single node? If yes → node or pod issue; if all pods → dependency or deploy.
  9. Health checks — liveness killing pods during long GC? Readiness removing from LB while still starting?

Decision tree (quick)

Spike aligns with jvm_gc_pause?        → GC tuning / heap / collector
Spike aligns with cpu throttled?       → Raise CPU limit or reduce burst work
Spike aligns with cron (:00)?          → Move job off request pool / separate deployment
All threads in JDBC/socket read?       → Dependency slow or pool too small
Gap in trace, no spans, JVM idle CPU?  → STW GC or safepoint / OS freeze
Only one pod / node?                   → Noisy neighbor, disk, hardware

What to log when it happens again

How — reduce stalls and prove improvement

Mitigations by cause

CauseFix
Long GC pausesRight-size heap; G1/ZGC tuning; reduce allocation; fix leak
CPU throttleIncrease requests/limits; reduce per-request CPU; spread cron
Pool exhaustionTimeouts, faster dependencies, bulkheads — pool guide
Retry stormExponential backoff + jitter; max retries; 429/503 rate limit
Cron on API threads@Scheduled on dedicated executor; separate worker deployment
Lock burstsShrink critical sections — BLOCKED guide
Aggressive livenessLiveness only on deadlock/hang; readiness on warm + DB
Downstream blipsCircuit breaker, cache, timeout — concurrency design

Scheduled work isolation (example)

@Configuration
class SchedulerConfig {
  @Bean
  TaskScheduler batchScheduler() {
    ThreadPoolTaskScheduler s = new ThreadPoolTaskScheduler();
    s.setPoolSize(2);
    s.setThreadNamePrefix("batch-");
    return s;
  }
}
// Do NOT run heavy reports on http-nio-* threads

Observability upgrades

Verify

  1. After fix, same traffic pattern: no p99 spikes above SLO for 7 days.
  2. GC max pause < target (e.g. 50ms for interactive APIs).
  3. Chaos: induce dependency slow path; service sheds load, does not freeze all threads.

Interview one-liner

“Short unresponsive windows usually show up in p99 and GC or throttle metrics, not the mean. I line up spikes with GC pauses, pool saturation, cron, and retries, use JFR or dumps during the spike, then fix the episodic cause—tuning GC, isolating batch work, or adding backpressure.”

Related scenarios