Application slow after a few hours

Scenario

After deploy or pod start, the service is fast. Two to six hours later, latency creeps up—p95 doubles, timeouts appear, CPU or memory look “off,” but traffic is unchanged. Restart fixes it temporarily. What do you check first before another blind restart?

After reading, you should be able to:

Separate gradual degradation causes (leak, pool exhaustion) from daily traffic patterns.
Run a ordered triage across JVM, pools, threads, and dependencies.
Correlate metrics over uptime, not just instantaneous dashboards.
Know which deep dives link to memory leak, GC pauses, and DB pool guides.

Why — degradation over uptime is a different bug class

“Slow from second one” points to config, cold code path, or undersized resources. Slow only after hours almost always means state accumulates—in memory, pools, queues, disks, or downstream connections—until the system spends more time waiting or collecting garbage than doing work.

Common mechanisms (ranked by how often they appear)

Mechanism	What grows	Typical symptom
Heap / memory leak	Live objects, caches	Longer GC pauses, then OOM
DB connection leak	Borrowed connections never returned	Requests block on `getConnection()`; pool “active” = max
Thread pool queue buildup	Pending tasks	Accept fast, work completes slowly; rejections later
File descriptor leak	Open sockets/files	“Too many open files”; weird network errors
HTTP client / downstream pool exhaustion	Stale connections, no eviction	Calls to other services time out in waves
Metaspace / classloader leak	Loaded classes	Full GC pauses; metaspace at limit after redeploys
Disk or log volume full	Log files	Blocking log writes; pod evictions
Scheduled / background job pile-up	Batch queue depth	CPU spikes on the hour; DB load follows

Restart masks the cause: If performance resets on pod restart but returns after fixed uptime, treat it as accumulation—capture metrics and dumps before the next restart.

What it is usually not

Peak traffic at end of business day (correlate RPS—if traffic rises with latency, capacity not leak).
One slow dependency for all requests (would be slow from minute one unless dependency degrades independently).
Needing more pods without checking per-pod health (scaling leaky pods delays failure).

What — check first (in order)

Build a timeline from process start (T+0) to slow (T+N hours). Overlay deploy time, traffic, and cron schedules.

Confirm “uptime vs latency” correlation Plot p95 latency vs pod age / process_uptime_seconds. Flat traffic + rising latency vs uptime = strong accumulation signal.
JVM heap and GC (5 minutes) Old gen after GC trending up? GC pause count and max increasing? → leak path or tuning path.
```
jstat -gcutil <pid> 5000   # watch O (old gen), FGC, FGCT every 5s
```
Thread pool and Tomcat / Jetty queues Active threads = max? Growing queue size? Blocked thread count up? Thread dump: most threads waiting on same lock or pool.
Database connection pool (HikariCP, etc.) Metrics: hikaricp.connections.active, pending, timeout total. Active stuck at maximum for minutes → leak or slow queries holding connections.
File descriptors and sockets
```
ls /proc/<pid>/fd | wc -l
lsof -p <pid> | wc -l
```
Steady climb over hours → socket/file leak (HTTP clients, unclosed streams, DB cursors).
Downstream dependency latency Traces: did DB or HTTP client p95 grow with uptime? One service red in service mesh graph.
CPU profile snapshot at “slow” time GC threads dominating? One method hot due to full table scan? Compare profile at T+30m vs T+4h.
Disk, logs, and cron df on log volume; check @Scheduled jobs, Quartz, K8s CronJobs aligned with slowdown.

Capture before restart (golden hour)

Artifact	Command / source
Heap dump	`jcmd <pid> GC.heap_dump /dumps/slow.hprof`
Thread dump ×2 (30s apart)	`jcmd <pid> Thread.print`
GC log excerpt	Last 200 lines around slowdown
Pool metrics export	Prometheus range query 6h window
Slow query log	DB top queries by total time

Thread dump patterns to recognize

Many threads in java.util.concurrent.ThreadPoolExecutor.getTask or await on pool — pool saturated.
Many in HikariDataSource.getConnection — DB pool wait.
Many in sun.misc.Unsafe.park on GC lock — long GC pause in progress.
Same custom lock — lock contention bug worsening as parallelism increases.

How — fix by root cause and prevent recurrence

Memory leak / GC pressure

Follow memory leak compare-dump workflow; fix retaining reference.
After fix, 4–8 hour soak test at flat load; old gen should plateau.

Connection leak (JDBC)

Always use try-with-resources or framework templates (@Transactional) so connections close.
Enable leak detection in staging: Hikari leakDetectionThreshold=2000 (ms).
Log stack trace when connection held too long (dev/stage only).

HTTP client / socket leak

Single shared HttpClient with connection pool; set idle eviction and timeouts.
Close response bodies: try (Response r = client.newCall(req).execute()) { ... }.

Thread pool exhaustion

Bounded queues + CallerRunsPolicy or reject with 503; never unbounded LinkedBlockingQueue on request path without monitoring.
Separate pools for HTTP workers vs batch jobs.
Alert on queue depth and pool active = max for > 1 min.

Background job pile-up

Rate-limit schedulers; skip if previous run still active (@DisallowConcurrentExecution).
Move heavy jobs to worker tier with its own scaling.

Operational guardrails

Dashboard: latency p95 vs pod uptime (derivation or recording rule).
Alert: latency +30% vs 1h ago while RPS within 10% → “degradation” page.
Periodic soak test in CI/staging matching prod heap and pool sizes.
Runbook: “slow after hours” → checklist links to this page; no restart until dumps captured.

When restart is acceptable

Restart to restore SLA after capturing dumps and metrics. Schedule fix before next deploy; do not rely on cron-restart as permanent solution unless you document technical debt.

Interview one-liner

“I plot latency against process uptime at flat traffic, then check heap after GC, connection pool active count, thread pool queue, and FD count over time—capturing heap and thread dumps before restart. Fixes are usually leak, pool leak, or unbounded queue—not just scale out.”

Why — degradation over uptime is a different bug class

Common mechanisms (ranked by how often they appear)

What it is usually not

What — check first (in order)

Capture before restart (golden hour)

Thread dump patterns to recognize

How — fix by root cause and prevent recurrence

Memory leak / GC pressure

Connection leak (JDBC)

HTTP client / socket leak

Thread pool exhaustion

Background job pile-up

Operational guardrails

When restart is acceptable

Interview one-liner

Related scenarios