DB connection pool exhausted and requests hang
Scenario
API latency spikes; thread dumps show dozens of threads in HikariPool.getConnection(). Metrics: hikaricp.connections.active pegged at maximum, connections.pending > 0. The database may still look healthy—apps are waiting for a free connection, not necessarily a slow query. You must distinguish pool sizing, leaks, long-held connections, and fleet-wide max_connections limits.
After reading, you should be able to:
- Tell pool exhaustion from slow SQL and from HTTP thread pool exhaustion.
- Read Hikari metrics and thread dumps for connection wait vs query time.
- Size pools against DB
max_connectionsand concurrent request demand. - Fix leaks, long transactions, and misaligned thread/pool counts.
Why — connections are a finite lease
A JDBC connection pool (HikariCP is the common Java default) maintains a fixed set of open TCP sessions to the database.
Each request that needs the DB borrows a connection and must return it when done.
If all connections are in use—running queries, sitting in a long transaction, or leaked—new callers block until
connectionTimeout or indefinitely, tying up HTTP worker threads.
Root causes
| Cause | What you see |
|---|---|
| Pool too small | Active = max under normal RPS; pending > 0 |
| Slow queries | Connections held seconds; fix SQL — DB slow guide |
| Long transactions | @Transactional spans HTTP + external calls |
| Connection leak | Active climbs over hours; leak detection fires |
| Too many pods × pool size | DB rejects new connects; “too many clients” |
| Threads ≫ connections | 200 Tomcat threads, 10 connections—all block |
| N+1 query storm | Many short holds per request — N+1 guide |
Do not set maximumPoolSize = 200 per pod. Ten pods × 200 = 2000 connections may exceed PostgreSQL max_connections and hurt DB performance for everyone.
What — confirm exhaustion and find the holder
-
Hikari metrics (Micrometer / JMX)
hikaricp.connections.active≈hikaricp.connections.maxhikaricp.connections.pending> 0hikaricp.connections.timeoutincreasing (threads gave up)
-
Thread dump
"http-nio-8080-exec-14" WAITING at com.zaxxer.hikari.pool.HikariPool.getConnection(...)
Many such threads → pool wait, not necessarily slow DB CPU. - Split wait vs execution in traces — JDBC span long with sub-span “get connection” vs “execute” (depends on instrumentation).
-
Database side
-- PostgreSQL: who holds connections? SELECT pid, usename, application_name, state, wait_event_type, now() - xact_start AS xact_age, query FROM pg_stat_activity WHERE datname = current_database() ORDER BY xact_start NULLS LAST;Manyidle in transaction→ long TX in app;activewith slow query → optimize SQL. -
Fleet math
total_conns ≈ replicas × maximumPoolSize + admin + batch jobs total_conns < DB max_connections − headroom (10–20%)
-
Leak detection (staging first)
spring.datasource.hikari.leak-detection-threshold=2000
Logs stack trace if connection held > 2s outside expected work.
How — fix, size, and prevent
1. Return connections faster (fix root cause first)
- Optimize slow queries; add indexes — DB slow.
- Keep transactions short—no HTTP calls inside
@Transactional. - Fix N+1 to reduce round trips per request.
- Use
readOnly = truefor read paths where appropriate.
2. Fix leaks
- Always use Spring Data / JdbcTemplate / try-with-resources—never store
Connectionin a field. - Custom JDBC:
try (Connection c = dataSource.getConnection()) { … } - Audit manual
getConnection()and streaming ResultSets that bypass pool return.
3. Size the pool (practical starting point)
# Per instance — not per thread # Often: cores * 2 + effective_spindle_count (legacy rule) # Simpler OLTP: 10–30 per pod, tune with load test maximumPoolSize = 20 minimumIdle = 5 connectionTimeout = 5000 # ms — fail fast, don't hang forever maxLifetime = 1800000 # 30 min — recycle before DB/firewall kills idleTimeout = 600000
Align with HTTP threads: if maximumPoolSize=20, do not run 200 blocking request threads all hitting the DB—use async, batching, or lower maxThreads.
4. Fleet-wide budget
# Example: Postgres max_connections = 300 # 10 app pods → max 25 per pod (250) + 50 headroom for migrations/admin
Use PgBouncer (transaction pooling) when you need many app instances but fewer real DB sessions.
5. Bulkheads
Separate datasource/pool for heavy reporting vs API so a batch job cannot drain the API pool.
Verify
- Under peak load: active connections < 80% max; pending ≈ 0.
- Thread dumps: no pile-up on
getConnection. - DB
numbackendsstable over 24h; no “too many connections” errors. - p99 API latency improved after fix (not only pool metrics).
Interview one-liner
“I check Hikari active vs max and pending, thread dumps on getConnection, and pg_stat_activity for idle-in-transaction. I fix slow SQL and transaction scope first, fix leaks, then size pool per pod so replicas stay under DB max_connections—without making maximumPoolSize equal to thread count.”