Application slow after a few hours
Scenario
After deploy or pod start, the service is fast. Two to six hours later, latency creeps up—p95 doubles, timeouts appear, CPU or memory look “off,” but traffic is unchanged. Restart fixes it temporarily. What do you check first before another blind restart?
After reading, you should be able to:
- Separate gradual degradation causes (leak, pool exhaustion) from daily traffic patterns.
- Run a ordered triage across JVM, pools, threads, and dependencies.
- Correlate metrics over uptime, not just instantaneous dashboards.
- Know which deep dives link to memory leak, GC pauses, and DB pool guides.
Why — degradation over uptime is a different bug class
“Slow from second one” points to config, cold code path, or undersized resources. Slow only after hours almost always means state accumulates—in memory, pools, queues, disks, or downstream connections—until the system spends more time waiting or collecting garbage than doing work.
Common mechanisms (ranked by how often they appear)
| Mechanism | What grows | Typical symptom |
|---|---|---|
| Heap / memory leak | Live objects, caches | Longer GC pauses, then OOM |
| DB connection leak | Borrowed connections never returned | Requests block on getConnection(); pool “active” = max |
| Thread pool queue buildup | Pending tasks | Accept fast, work completes slowly; rejections later |
| File descriptor leak | Open sockets/files | “Too many open files”; weird network errors |
| HTTP client / downstream pool exhaustion | Stale connections, no eviction | Calls to other services time out in waves |
| Metaspace / classloader leak | Loaded classes | Full GC pauses; metaspace at limit after redeploys |
| Disk or log volume full | Log files | Blocking log writes; pod evictions |
| Scheduled / background job pile-up | Batch queue depth | CPU spikes on the hour; DB load follows |
Restart masks the cause: If performance resets on pod restart but returns after fixed uptime, treat it as accumulation—capture metrics and dumps before the next restart.
What it is usually not
- Peak traffic at end of business day (correlate RPS—if traffic rises with latency, capacity not leak).
- One slow dependency for all requests (would be slow from minute one unless dependency degrades independently).
- Needing more pods without checking per-pod health (scaling leaky pods delays failure).
What — check first (in order)
Build a timeline from process start (T+0) to slow (T+N hours). Overlay deploy time, traffic, and cron schedules.
-
Confirm “uptime vs latency” correlation
Plot p95 latency vs pod age /
process_uptime_seconds. Flat traffic + rising latency vs uptime = strong accumulation signal. -
JVM heap and GC (5 minutes)
Old gen after GC trending up? GC pause count and max increasing? → leak path or tuning path.
jstat -gcutil <pid> 5000 # watch O (old gen), FGC, FGCT every 5s
- Thread pool and Tomcat / Jetty queues Active threads = max? Growing queue size? Blocked thread count up? Thread dump: most threads waiting on same lock or pool.
-
Database connection pool (HikariCP, etc.)
Metrics:
hikaricp.connections.active,pending,timeouttotal. Active stuck at maximum for minutes → leak or slow queries holding connections. -
File descriptors and sockets
ls /proc/<pid>/fd | wc -l lsof -p <pid> | wc -l
Steady climb over hours → socket/file leak (HTTP clients, unclosed streams, DB cursors). - Downstream dependency latency Traces: did DB or HTTP client p95 grow with uptime? One service red in service mesh graph.
- CPU profile snapshot at “slow” time GC threads dominating? One method hot due to full table scan? Compare profile at T+30m vs T+4h.
-
Disk, logs, and cron
dfon log volume; check @Scheduled jobs, Quartz, K8s CronJobs aligned with slowdown.
Capture before restart (golden hour)
| Artifact | Command / source |
|---|---|
| Heap dump | jcmd <pid> GC.heap_dump /dumps/slow.hprof |
| Thread dump ×2 (30s apart) | jcmd <pid> Thread.print |
| GC log excerpt | Last 200 lines around slowdown |
| Pool metrics export | Prometheus range query 6h window |
| Slow query log | DB top queries by total time |
Thread dump patterns to recognize
- Many threads in
java.util.concurrent.ThreadPoolExecutor.getTaskorawaiton pool — pool saturated. - Many in
HikariDataSource.getConnection— DB pool wait. - Many in
sun.misc.Unsafe.parkon GC lock — long GC pause in progress. - Same custom lock — lock contention bug worsening as parallelism increases.
How — fix by root cause and prevent recurrence
Memory leak / GC pressure
- Follow memory leak compare-dump workflow; fix retaining reference.
- After fix, 4–8 hour soak test at flat load; old gen should plateau.
Connection leak (JDBC)
- Always use try-with-resources or framework templates (
@Transactional) so connections close. - Enable leak detection in staging: Hikari
leakDetectionThreshold=2000(ms). - Log stack trace when connection held too long (dev/stage only).
HTTP client / socket leak
- Single shared
HttpClientwith connection pool; set idle eviction and timeouts. - Close response bodies:
try (Response r = client.newCall(req).execute()) { ... }.
Thread pool exhaustion
- Bounded queues +
CallerRunsPolicyor reject with 503; never unboundedLinkedBlockingQueueon request path without monitoring. - Separate pools for HTTP workers vs batch jobs.
- Alert on queue depth and pool active = max for > 1 min.
Background job pile-up
- Rate-limit schedulers; skip if previous run still active (
@DisallowConcurrentExecution). - Move heavy jobs to worker tier with its own scaling.
Operational guardrails
- Dashboard: latency p95 vs pod uptime (derivation or recording rule).
- Alert: latency +30% vs 1h ago while RPS within 10% → “degradation” page.
- Periodic soak test in CI/staging matching prod heap and pool sizes.
- Runbook: “slow after hours” → checklist links to this page; no restart until dumps captured.
When restart is acceptable
Restart to restore SLA after capturing dumps and metrics. Schedule fix before next deploy; do not rely on cron-restart as permanent solution unless you document technical debt.
Interview one-liner
“I plot latency against process uptime at flat traffic, then check heap after GC, connection pool active count, thread pool queue, and FD count over time—capturing heap and thread dumps before restart. Fixes are usually leak, pool leak, or unbounded queue—not just scale out.”