DB connection pool exhausted and requests hang

Scenario

API latency spikes; thread dumps show dozens of threads in HikariPool.getConnection(). Metrics: hikaricp.connections.active pegged at maximum, connections.pending > 0. The database may still look healthy—apps are waiting for a free connection, not necessarily a slow query. You must distinguish pool sizing, leaks, long-held connections, and fleet-wide max_connections limits.

After reading, you should be able to:

Tell pool exhaustion from slow SQL and from HTTP thread pool exhaustion.
Read Hikari metrics and thread dumps for connection wait vs query time.
Size pools against DB max_connections and concurrent request demand.
Fix leaks, long transactions, and misaligned thread/pool counts.

Why — connections are a finite lease

A JDBC connection pool (HikariCP is the common Java default) maintains a fixed set of open TCP sessions to the database. Each request that needs the DB borrows a connection and must return it when done. If all connections are in use—running queries, sitting in a long transaction, or leaked—new callers block until connectionTimeout or indefinitely, tying up HTTP worker threads.

Root causes

Cause	What you see
Pool too small	Active = max under normal RPS; pending > 0
Slow queries	Connections held seconds; fix SQL — DB slow guide
Long transactions	@Transactional spans HTTP + external calls
Connection leak	Active climbs over hours; leak detection fires
Too many pods × pool size	DB rejects new connects; “too many clients”
Threads ≫ connections	200 Tomcat threads, 10 connections—all block
N+1 query storm	Many short holds per request — N+1 guide

Do not set maximumPoolSize = 200 per pod. Ten pods × 200 = 2000 connections may exceed PostgreSQL max_connections and hurt DB performance for everyone.

What — confirm exhaustion and find the holder

Hikari metrics (Micrometer / JMX)
- hikaricp.connections.active ≈ hikaricp.connections.max
- hikaricp.connections.pending > 0
- hikaricp.connections.timeout increasing (threads gave up)

Thread dump

"http-nio-8080-exec-14" WAITING
  at com.zaxxer.hikari.pool.HikariPool.getConnection(...)

Many such threads → pool wait, not necessarily slow DB CPU.

Split wait vs execution in traces — JDBC span long with sub-span “get connection” vs “execute” (depends on instrumentation).

Database side

-- PostgreSQL: who holds connections?
SELECT pid, usename, application_name, state, wait_event_type,
       now() - xact_start AS xact_age, query
FROM pg_stat_activity
WHERE datname = current_database()
ORDER BY xact_start NULLS LAST;

Many idle in transaction → long TX in app; active with slow query → optimize SQL.

Fleet math

total_conns ≈ replicas × maximumPoolSize + admin + batch jobs
total_conns < DB max_connections − headroom (10–20%)

Leak detection (staging first)
```
spring.datasource.hikari.leak-detection-threshold=2000
```
Logs stack trace if connection held > 2s outside expected work.

How — fix, size, and prevent

1. Return connections faster (fix root cause first)

Optimize slow queries; add indexes — DB slow.
Keep transactions short—no HTTP calls inside @Transactional.
Fix N+1 to reduce round trips per request.
Use readOnly = true for read paths where appropriate.

2. Fix leaks

Always use Spring Data / JdbcTemplate / try-with-resources—never store Connection in a field.
Custom JDBC: try (Connection c = dataSource.getConnection()) { … }
Audit manual getConnection() and streaming ResultSets that bypass pool return.

3. Size the pool (practical starting point)

# Per instance — not per thread
# Often: cores * 2 + effective_spindle_count (legacy rule)
# Simpler OLTP: 10–30 per pod, tune with load test

maximumPoolSize = 20
minimumIdle = 5
connectionTimeout = 5000      # ms — fail fast, don't hang forever
maxLifetime = 1800000         # 30 min — recycle before DB/firewall kills
idleTimeout = 600000

Align with HTTP threads: if maximumPoolSize=20, do not run 200 blocking request threads all hitting the DB—use async, batching, or lower maxThreads.

4. Fleet-wide budget

# Example: Postgres max_connections = 300
# 10 app pods → max 25 per pod (250) + 50 headroom for migrations/admin

Use PgBouncer (transaction pooling) when you need many app instances but fewer real DB sessions.

5. Bulkheads

Separate datasource/pool for heavy reporting vs API so a batch job cannot drain the API pool.

Verify

Under peak load: active connections < 80% max; pending ≈ 0.
Thread dumps: no pile-up on getConnection.
DB numbackends stable over 24h; no “too many connections” errors.
p99 API latency improved after fix (not only pool metrics).

Interview one-liner

“I check Hikari active vs max and pending, thread dumps on getConnection, and pg_stat_activity for idle-in-transaction. I fix slow SQL and transaction scope first, fix leaks, then size pool per pod so replicas stay under DB max_connections—without making maximumPoolSize equal to thread count.”

Why — connections are a finite lease

Root causes

What — confirm exhaustion and find the holder

How — fix, size, and prevent

1. Return connections faster (fix root cause first)

2. Fix leaks

3. Size the pool (practical starting point)

4. Fleet-wide budget

5. Bulkheads

Verify

Interview one-liner

Related scenarios