High CPU with low traffic

Scenario

CPU on the service is ~90% but HTTP RPS is near idle. Autoscaling wants to add pods—or you are burning cost on hot containers—with no obvious load. What is consuming CPU if not user requests?

After reading, you should be able to:

Separate user-driven CPU from background work (GC, schedulers, agents).
Find hot threads and methods with profiler and thread dumps.
Recognize common patterns: GC thrash, polling loops, debug logging, spin locks.
Fix and verify with CPU metrics at flat RPS.

Why — CPU without requests still has work

CPU time is spent executing instructions on cores. User HTTP traffic is only one source. A JVM process can burn CPU on garbage collection, JIT compilation, scheduled jobs, tight polling loops, health checks, metrics export, or a buggy busy-wait—all while RPS looks low.

Common causes (check these first)

Cause	Why RPS looks low	Clue
GC overhead	GC is not “requests”	High `process_cpu` + GC threads hot; see GC pauses
Memory pressure / leak	Constant collection	Heap or metaspace climbing; leak, slow after hours
Tight polling loop	Background thread	One thread at 100% in profiler; `while(true)` with short sleep or no sleep
Excessive logging	Logs per tick, not per request	DEBUG enabled in prod; sync appenders; huge stack traces
Scheduler storm	@Scheduled / Quartz / cron	CPU spike every minute; overlapping runs
Health check / mesh probe load	Probes count as traffic internally	Heavy `/health` doing DB checks every 1s × many pods
Metrics or APM agent	Instrumentation overhead	CPU drop when agent disabled (canary test)
Spin / lock contention	Threads burn CPU waiting	Many threads in RUNNABLE on same stack frame
Regex / crypto hot path	Triggered by one bad message or job	Profiler shows `Pattern`, `SHA`, BCrypt
JIT compilation (short)	After deploy only	CPU high first 2–5 min then normal—may be OK

Verify “low traffic” correctly. Include internal callers, Kafka consumers, gRPC from other services, and health probes. Low external RPS does not mean low work.

Misleading metrics

CPU averaged across cores—one core at 100% can show 25% on a 4-core limit.
Looking only at ingress RPS—background consumers still run.
Comparing to wrong time zone peak—cron job at :00 UTC.

What — find who burns CPU (in order)

Confirm CPU vs traffic on same chart rate(process_cpu_seconds_total[5m]) vs http_server_requests_seconds_count. Note deploy time, cron alignment, consumer lag.
Check GC CPU fraction jstat -gcutil <pid> 1000 — if FGC/YGCT climb fast with low RPS → GC path. JVM flag: -Xlog:gc* and sum GC worker CPU in profiler.

Thread dump — RUNNABLE threads

jcmd <pid> Thread.print
# Count threads in RUNNABLE vs WAITING
# Same stack on many RUNNABLE lines → hot loop or spin

CPU profiler (low overhead) async-profiler: -e cpu -d 60 -f /tmp/cpu.html or JFR cpu sample in JDK Mission Control. Top frames = where to fix.

OS top / per-thread CPU

top -H -p <pid>
ps -Lp <pid> -o pid,tid,pcpu,comm --sort=-pcpu | head

Map TID to thread name in jstack (nid=0x...).

Logging level and volume Check prod config: root logger DEBUG? Logs/sec metric? Disk I/O wait high?
Scheduled tasks List @Scheduled beans; cron in K8s; overlap if job duration > interval.
Isolate with experiments Disable non-critical scheduler (feature flag); disable DEBUG; increase probe interval on one canary pod—compare CPU.

Map profiler output to action

Hot frame (examples)	Likely fix
`GarbageCollector.*`	Reduce allocation, fix leak, tune GC, more heap
`ch.qos.logback` / Log4j	Level INFO; async appender; reduce log volume
`Thread.sleep` missing in loop	Add backoff; replace poll with blocking consumer
`java.util.regex`	Fix catastrophic backtracking; precompile patterns
`sun.misc.Unsafe.park` low but many RUNNABLE	Lock contention; shrink critical section
`org.springframework.scheduling`	Fix cron overlap; move job to worker tier

How — fix, verify, guard

Fix by root cause

GC / memory — follow memory leak and GC tuning; verify allocation rate drops.
Polling — use blocking kafka consumer poll, wait/notify, or event-driven; minimum sleep with jitter if poll required.
Logging — prod at INFO/WARN; async appenders; never log full payload at DEBUG in hot path.
Schedulers — @DisallowConcurrentExecution; skip if previous run active; move heavy jobs off request pods.
Health checks — liveness = lightweight; readiness may hit DB; don’t run expensive checks every second.
Crypto — cache verified tokens; use faster algorithms where appropriate; offload to sidecar only if justified.

Verify the fix

Record baseline CPU at current RPS (even if low).
Deploy fix to canary; wait past JIT warm-up (5+ min).
CPU at same RPS should drop materially (e.g. 90% → <30% on limit).
Re-profile 60s to confirm hot method gone.

Kubernetes / autoscaling

Do not scale on CPU alone if CPU is misconfigured—fix waste first.
HPA: consider custom metrics (RPS, queue lag) alongside CPU after fix.
Set requests/limits so throttling does not look like “mystery slowness.”

Prevention

CI: forbid root DEBUG in prod config; lint log statements in hot loops.
Dashboard: CPU vs RPS ratio; alert if CPU > 70% while RPS < 10% of daily peak.
Profile CPU for 30s in staging load test before each major release.

Interview one-liner

“Low RPS doesn’t mean idle—I check GC time, thread dumps for RUNNABLE hot loops, and a CPU profiler. Common wins are GC from leaks, DEBUG logging, tight polling schedulers, and heavy health checks; I verify with CPU at the same traffic after the fix.”

Why — CPU without requests still has work

Common causes (check these first)

Misleading metrics

What — find who burns CPU (in order)

Map profiler output to action

How — fix, verify, guard

Fix by root cause

Verify the fix

Kubernetes / autoscaling

Prevention

Interview one-liner

Related scenarios