High CPU with low traffic

Scenario

CPU on the service is ~90% but HTTP RPS is near idle. Autoscaling wants to add pods—or you are burning cost on hot containers—with no obvious load. What is consuming CPU if not user requests?

After reading, you should be able to:

Why — CPU without requests still has work

CPU time is spent executing instructions on cores. User HTTP traffic is only one source. A JVM process can burn CPU on garbage collection, JIT compilation, scheduled jobs, tight polling loops, health checks, metrics export, or a buggy busy-wait—all while RPS looks low.

Common causes (check these first)

CauseWhy RPS looks lowClue
GC overhead GC is not “requests” High process_cpu + GC threads hot; see GC pauses
Memory pressure / leak Constant collection Heap or metaspace climbing; leak, slow after hours
Tight polling loop Background thread One thread at 100% in profiler; while(true) with short sleep or no sleep
Excessive logging Logs per tick, not per request DEBUG enabled in prod; sync appenders; huge stack traces
Scheduler storm @Scheduled / Quartz / cron CPU spike every minute; overlapping runs
Health check / mesh probe load Probes count as traffic internally Heavy /health doing DB checks every 1s × many pods
Metrics or APM agent Instrumentation overhead CPU drop when agent disabled (canary test)
Spin / lock contention Threads burn CPU waiting Many threads in RUNNABLE on same stack frame
Regex / crypto hot path Triggered by one bad message or job Profiler shows Pattern, SHA, BCrypt
JIT compilation (short) After deploy only CPU high first 2–5 min then normal—may be OK

Verify “low traffic” correctly. Include internal callers, Kafka consumers, gRPC from other services, and health probes. Low external RPS does not mean low work.

Misleading metrics

What — find who burns CPU (in order)

  1. Confirm CPU vs traffic on same chart rate(process_cpu_seconds_total[5m]) vs http_server_requests_seconds_count. Note deploy time, cron alignment, consumer lag.
  2. Check GC CPU fraction jstat -gcutil <pid> 1000 — if FGC/YGCT climb fast with low RPS → GC path. JVM flag: -Xlog:gc* and sum GC worker CPU in profiler.
  3. Thread dump — RUNNABLE threads
    jcmd <pid> Thread.print
    # Count threads in RUNNABLE vs WAITING
    # Same stack on many RUNNABLE lines → hot loop or spin
  4. CPU profiler (low overhead) async-profiler: -e cpu -d 60 -f /tmp/cpu.html or JFR cpu sample in JDK Mission Control. Top frames = where to fix.
  5. OS top / per-thread CPU
    top -H -p <pid>
    ps -Lp <pid> -o pid,tid,pcpu,comm --sort=-pcpu | head
    Map TID to thread name in jstack (nid=0x...).
  6. Logging level and volume Check prod config: root logger DEBUG? Logs/sec metric? Disk I/O wait high?
  7. Scheduled tasks List @Scheduled beans; cron in K8s; overlap if job duration > interval.
  8. Isolate with experiments Disable non-critical scheduler (feature flag); disable DEBUG; increase probe interval on one canary pod—compare CPU.

Map profiler output to action

Hot frame (examples)Likely fix
GarbageCollector.*Reduce allocation, fix leak, tune GC, more heap
ch.qos.logback / Log4jLevel INFO; async appender; reduce log volume
Thread.sleep missing in loopAdd backoff; replace poll with blocking consumer
java.util.regexFix catastrophic backtracking; precompile patterns
sun.misc.Unsafe.park low but many RUNNABLELock contention; shrink critical section
org.springframework.schedulingFix cron overlap; move job to worker tier

How — fix, verify, guard

Fix by root cause

Verify the fix

  1. Record baseline CPU at current RPS (even if low).
  2. Deploy fix to canary; wait past JIT warm-up (5+ min).
  3. CPU at same RPS should drop materially (e.g. 90% → <30% on limit).
  4. Re-profile 60s to confirm hot method gone.

Kubernetes / autoscaling

Prevention

Interview one-liner

“Low RPS doesn’t mean idle—I check GC time, thread dumps for RUNNABLE hot loops, and a CPU profiler. Common wins are GC from leaks, DEBUG logging, tight polling schedulers, and heavy health checks; I verify with CPU at the same traffic after the fix.”

Related scenarios