Application slow after a few hours

Scenario

After deploy or pod start, the service is fast. Two to six hours later, latency creeps up—p95 doubles, timeouts appear, CPU or memory look “off,” but traffic is unchanged. Restart fixes it temporarily. What do you check first before another blind restart?

After reading, you should be able to:

Why — degradation over uptime is a different bug class

“Slow from second one” points to config, cold code path, or undersized resources. Slow only after hours almost always means state accumulates—in memory, pools, queues, disks, or downstream connections—until the system spends more time waiting or collecting garbage than doing work.

Common mechanisms (ranked by how often they appear)

MechanismWhat growsTypical symptom
Heap / memory leak Live objects, caches Longer GC pauses, then OOM
DB connection leak Borrowed connections never returned Requests block on getConnection(); pool “active” = max
Thread pool queue buildup Pending tasks Accept fast, work completes slowly; rejections later
File descriptor leak Open sockets/files “Too many open files”; weird network errors
HTTP client / downstream pool exhaustion Stale connections, no eviction Calls to other services time out in waves
Metaspace / classloader leak Loaded classes Full GC pauses; metaspace at limit after redeploys
Disk or log volume full Log files Blocking log writes; pod evictions
Scheduled / background job pile-up Batch queue depth CPU spikes on the hour; DB load follows

Restart masks the cause: If performance resets on pod restart but returns after fixed uptime, treat it as accumulation—capture metrics and dumps before the next restart.

What it is usually not

What — check first (in order)

Build a timeline from process start (T+0) to slow (T+N hours). Overlay deploy time, traffic, and cron schedules.

  1. Confirm “uptime vs latency” correlation Plot p95 latency vs pod age / process_uptime_seconds. Flat traffic + rising latency vs uptime = strong accumulation signal.
  2. JVM heap and GC (5 minutes) Old gen after GC trending up? GC pause count and max increasing? → leak path or tuning path.
    jstat -gcutil <pid> 5000   # watch O (old gen), FGC, FGCT every 5s
  3. Thread pool and Tomcat / Jetty queues Active threads = max? Growing queue size? Blocked thread count up? Thread dump: most threads waiting on same lock or pool.
  4. Database connection pool (HikariCP, etc.) Metrics: hikaricp.connections.active, pending, timeout total. Active stuck at maximum for minutes → leak or slow queries holding connections.
  5. File descriptors and sockets
    ls /proc/<pid>/fd | wc -l
    lsof -p <pid> | wc -l
    Steady climb over hours → socket/file leak (HTTP clients, unclosed streams, DB cursors).
  6. Downstream dependency latency Traces: did DB or HTTP client p95 grow with uptime? One service red in service mesh graph.
  7. CPU profile snapshot at “slow” time GC threads dominating? One method hot due to full table scan? Compare profile at T+30m vs T+4h.
  8. Disk, logs, and cron df on log volume; check @Scheduled jobs, Quartz, K8s CronJobs aligned with slowdown.

Capture before restart (golden hour)

ArtifactCommand / source
Heap dumpjcmd <pid> GC.heap_dump /dumps/slow.hprof
Thread dump ×2 (30s apart)jcmd <pid> Thread.print
GC log excerptLast 200 lines around slowdown
Pool metrics exportPrometheus range query 6h window
Slow query logDB top queries by total time

Thread dump patterns to recognize

How — fix by root cause and prevent recurrence

Memory leak / GC pressure

Connection leak (JDBC)

HTTP client / socket leak

Thread pool exhaustion

Background job pile-up

Operational guardrails

When restart is acceptable

Restart to restore SLA after capturing dumps and metrics. Schedule fix before next deploy; do not rely on cron-restart as permanent solution unless you document technical debt.

Interview one-liner

“I plot latency against process uptime at flat traffic, then check heap after GC, connection pool active count, thread pool queue, and FD count over time—capturing heap and thread dumps before restart. Fixes are usually leak, pool leak, or unbounded queue—not just scale out.”

Related scenarios