High CPU with low traffic
Scenario
CPU on the service is ~90% but HTTP RPS is near idle. Autoscaling wants to add pods—or you are burning cost on hot containers—with no obvious load. What is consuming CPU if not user requests?
After reading, you should be able to:
- Separate user-driven CPU from background work (GC, schedulers, agents).
- Find hot threads and methods with profiler and thread dumps.
- Recognize common patterns: GC thrash, polling loops, debug logging, spin locks.
- Fix and verify with CPU metrics at flat RPS.
Why — CPU without requests still has work
CPU time is spent executing instructions on cores. User HTTP traffic is only one source. A JVM process can burn CPU on garbage collection, JIT compilation, scheduled jobs, tight polling loops, health checks, metrics export, or a buggy busy-wait—all while RPS looks low.
Common causes (check these first)
| Cause | Why RPS looks low | Clue |
|---|---|---|
| GC overhead | GC is not “requests” | High process_cpu + GC threads hot; see GC pauses |
| Memory pressure / leak | Constant collection | Heap or metaspace climbing; leak, slow after hours |
| Tight polling loop | Background thread | One thread at 100% in profiler; while(true) with short sleep or no sleep |
| Excessive logging | Logs per tick, not per request | DEBUG enabled in prod; sync appenders; huge stack traces |
| Scheduler storm | @Scheduled / Quartz / cron | CPU spike every minute; overlapping runs |
| Health check / mesh probe load | Probes count as traffic internally | Heavy /health doing DB checks every 1s × many pods |
| Metrics or APM agent | Instrumentation overhead | CPU drop when agent disabled (canary test) |
| Spin / lock contention | Threads burn CPU waiting | Many threads in RUNNABLE on same stack frame |
| Regex / crypto hot path | Triggered by one bad message or job | Profiler shows Pattern, SHA, BCrypt |
| JIT compilation (short) | After deploy only | CPU high first 2–5 min then normal—may be OK |
Verify “low traffic” correctly. Include internal callers, Kafka consumers, gRPC from other services, and health probes. Low external RPS does not mean low work.
Misleading metrics
- CPU averaged across cores—one core at 100% can show 25% on a 4-core limit.
- Looking only at ingress RPS—background consumers still run.
- Comparing to wrong time zone peak—cron job at :00 UTC.
What — find who burns CPU (in order)
-
Confirm CPU vs traffic on same chart
rate(process_cpu_seconds_total[5m])vshttp_server_requests_seconds_count. Note deploy time, cron alignment, consumer lag. -
Check GC CPU fraction
jstat -gcutil <pid> 1000— ifFGC/YGCTclimb fast with low RPS → GC path. JVM flag:-Xlog:gc*and sum GC worker CPU in profiler. -
Thread dump — RUNNABLE threads
jcmd <pid> Thread.print # Count threads in RUNNABLE vs WAITING # Same stack on many RUNNABLE lines → hot loop or spin
-
CPU profiler (low overhead)
async-profiler:
-e cpu -d 60 -f /tmp/cpu.htmlor JFRcpusample in JDK Mission Control. Top frames = where to fix. -
OS top / per-thread CPU
top -H -p <pid> ps -Lp <pid> -o pid,tid,pcpu,comm --sort=-pcpu | head
Map TID to thread name in jstack (nid=0x...). - Logging level and volume Check prod config: root logger DEBUG? Logs/sec metric? Disk I/O wait high?
-
Scheduled tasks
List
@Scheduledbeans; cron in K8s; overlap if job duration > interval. - Isolate with experiments Disable non-critical scheduler (feature flag); disable DEBUG; increase probe interval on one canary pod—compare CPU.
Map profiler output to action
| Hot frame (examples) | Likely fix |
|---|---|
GarbageCollector.* | Reduce allocation, fix leak, tune GC, more heap |
ch.qos.logback / Log4j | Level INFO; async appender; reduce log volume |
Thread.sleep missing in loop | Add backoff; replace poll with blocking consumer |
java.util.regex | Fix catastrophic backtracking; precompile patterns |
sun.misc.Unsafe.park low but many RUNNABLE | Lock contention; shrink critical section |
org.springframework.scheduling | Fix cron overlap; move job to worker tier |
How — fix, verify, guard
Fix by root cause
- GC / memory — follow memory leak and GC tuning; verify allocation rate drops.
- Polling — use blocking
kafka consumer poll,wait/notify, or event-driven; minimum sleep with jitter if poll required. - Logging — prod at INFO/WARN; async appenders; never log full payload at DEBUG in hot path.
- Schedulers —
@DisallowConcurrentExecution; skip if previous run active; move heavy jobs off request pods. - Health checks — liveness = lightweight; readiness may hit DB; don’t run expensive checks every second.
- Crypto — cache verified tokens; use faster algorithms where appropriate; offload to sidecar only if justified.
Verify the fix
- Record baseline CPU at current RPS (even if low).
- Deploy fix to canary; wait past JIT warm-up (5+ min).
- CPU at same RPS should drop materially (e.g. 90% → <30% on limit).
- Re-profile 60s to confirm hot method gone.
Kubernetes / autoscaling
- Do not scale on CPU alone if CPU is misconfigured—fix waste first.
- HPA: consider custom metrics (RPS, queue lag) alongside CPU after fix.
- Set requests/limits so throttling does not look like “mystery slowness.”
Prevention
- CI: forbid root DEBUG in prod config; lint log statements in hot loops.
- Dashboard: CPU vs RPS ratio; alert if CPU > 70% while RPS < 10% of daily peak.
- Profile CPU for 30s in staging load test before each major release.
Interview one-liner
“Low RPS doesn’t mean idle—I check GC time, thread dumps for RUNNABLE hot loops, and a CPU profiler. Common wins are GC from leaks, DEBUG logging, tight polling schedulers, and heavy health checks; I verify with CPU at the same traffic after the fix.”