DB connection pool exhausted and requests hang

Scenario

API latency spikes; thread dumps show dozens of threads in HikariPool.getConnection(). Metrics: hikaricp.connections.active pegged at maximum, connections.pending > 0. The database may still look healthy—apps are waiting for a free connection, not necessarily a slow query. You must distinguish pool sizing, leaks, long-held connections, and fleet-wide max_connections limits.

After reading, you should be able to:

Why — connections are a finite lease

A JDBC connection pool (HikariCP is the common Java default) maintains a fixed set of open TCP sessions to the database. Each request that needs the DB borrows a connection and must return it when done. If all connections are in use—running queries, sitting in a long transaction, or leaked—new callers block until connectionTimeout or indefinitely, tying up HTTP worker threads.

Root causes

CauseWhat you see
Pool too smallActive = max under normal RPS; pending > 0
Slow queriesConnections held seconds; fix SQL — DB slow guide
Long transactions@Transactional spans HTTP + external calls
Connection leakActive climbs over hours; leak detection fires
Too many pods × pool sizeDB rejects new connects; “too many clients”
Threads ≫ connections200 Tomcat threads, 10 connections—all block
N+1 query stormMany short holds per request — N+1 guide

Do not set maximumPoolSize = 200 per pod. Ten pods × 200 = 2000 connections may exceed PostgreSQL max_connections and hurt DB performance for everyone.

What — confirm exhaustion and find the holder

  1. Hikari metrics (Micrometer / JMX)
    • hikaricp.connections.activehikaricp.connections.max
    • hikaricp.connections.pending > 0
    • hikaricp.connections.timeout increasing (threads gave up)
  2. Thread dump
    "http-nio-8080-exec-14" WAITING
      at com.zaxxer.hikari.pool.HikariPool.getConnection(...)
    Many such threads → pool wait, not necessarily slow DB CPU.
  3. Split wait vs execution in traces — JDBC span long with sub-span “get connection” vs “execute” (depends on instrumentation).
  4. Database side
    -- PostgreSQL: who holds connections?
    SELECT pid, usename, application_name, state, wait_event_type,
           now() - xact_start AS xact_age, query
    FROM pg_stat_activity
    WHERE datname = current_database()
    ORDER BY xact_start NULLS LAST;
    Many idle in transaction → long TX in app; active with slow query → optimize SQL.
  5. Fleet math
    total_conns ≈ replicas × maximumPoolSize + admin + batch jobs
    total_conns < DB max_connections − headroom (10–20%)
  6. Leak detection (staging first)
    spring.datasource.hikari.leak-detection-threshold=2000
    Logs stack trace if connection held > 2s outside expected work.

How — fix, size, and prevent

1. Return connections faster (fix root cause first)

2. Fix leaks

3. Size the pool (practical starting point)

# Per instance — not per thread
# Often: cores * 2 + effective_spindle_count (legacy rule)
# Simpler OLTP: 10–30 per pod, tune with load test

maximumPoolSize = 20
minimumIdle = 5
connectionTimeout = 5000      # ms — fail fast, don't hang forever
maxLifetime = 1800000         # 30 min — recycle before DB/firewall kills
idleTimeout = 600000

Align with HTTP threads: if maximumPoolSize=20, do not run 200 blocking request threads all hitting the DB—use async, batching, or lower maxThreads.

4. Fleet-wide budget

# Example: Postgres max_connections = 300
# 10 app pods → max 25 per pod (250) + 50 headroom for migrations/admin

Use PgBouncer (transaction pooling) when you need many app instances but fewer real DB sessions.

5. Bulkheads

Separate datasource/pool for heavy reporting vs API so a batch job cannot drain the API pool.

Verify

  1. Under peak load: active connections < 80% max; pending ≈ 0.
  2. Thread dumps: no pile-up on getConnection.
  3. DB numbackends stable over 24h; no “too many connections” errors.
  4. p99 API latency improved after fix (not only pool metrics).

Interview one-liner

“I check Hikari active vs max and pending, thread dumps on getConnection, and pg_stat_activity for idle-in-transaction. I fix slow SQL and transaction scope first, fix leaks, then size pool per pod so replicas stay under DB max_connections—without making maximumPoolSize equal to thread count.”

Related scenarios