Database calls are slowing down the service

Scenario

API p99 jumped from 200ms to 2s. Traces show most time in JDBC spans. CPU on app servers is moderate—the database (or how the app uses it) is the bottleneck. You need to find which queries, why they are slow, and apply fixes that survive peak traffic without guessing indexes.

After reading, you should be able to:

Why — the DB amplifies application mistakes

A single HTTP request can execute one efficient query or hundreds of slow ones. The database does work proportional to rows examined, locks held, and round trips. Under load, a missing index or N+1 pattern turns into pool wait and thread pool exhaustion upstream.

Common causes

CauseSymptom
Full table / index scanLatency grows with table size
N+1 queriesQuery count ∝ list size in one API call
Over-fetchingSELECT *, huge lists, no LIMIT
Lock waitSpikes at peak; DB “waiting” metrics — see row locks
Stale statisticsPlanner chooses bad plan after bulk load
Connection overheadMany short queries; pool too small — pool guide
Replica lag readsRetries, inconsistent UX — replica lag
Disk / IOPS limitDB CPU low but latency high; storage saturated

What — find the expensive queries first

  1. Confirm DB is on the critical path — trace waterfall: JDBC span > 50% of request time; metric jdbc.query.duration p99 up.
  2. Application-level query count — log or metric queries-per-request; 1 vs 200 for same endpoint → N+1 suspicion.
  3. Slow query log (database) — PostgreSQL log_min_duration_statement; MySQL slow log; RDS Performance Insights.
  4. Aggregate by normalized SQL
    -- PostgreSQL: pg_stat_statements
    SELECT calls, mean_exec_time, total_exec_time, query
    FROM pg_stat_statements
    ORDER BY total_exec_time DESC
    LIMIT 20;
  5. Capture one slow statement + EXPLAIN
    EXPLAIN (ANALYZE, BUFFERS)
    SELECT … FROM orders WHERE user_id = ? AND status = 'OPEN';
    Look for Seq Scan, high rows, nested loops with huge inner rows.
  6. Check indexes exist for filter/join columns — composite index order matches WHERE + ORDER BY leading columns.
  7. ORM visibility — enable Hibernate SQL stats / datasource-proxy / p6spy in staging; map SQL to repository method.
  8. Rule out pool wait vs query time — Hikari “connection acquired” wait vs statement execution in trace spans.

EXPLAIN red flags (PostgreSQL-oriented)

How — optimizations (prioritized)

1. Reduce work per request (often biggest win)

2. Index deliberately

-- Example: filter by user_id + status, sort by created_at
CREATE INDEX CONCURRENTLY idx_orders_user_status_created
  ON orders (user_id, status, created_at DESC);

3. Caching and read path

4. Database maintenance

5. When the DB itself is undersized

If EXPLAIN is already optimal and latency tracks IOPS/CPU on RDS—scale instance, storage type, or connection pooling (PgBouncer), not more app threads.

JPA fetch join sketch

@Query("SELECT o FROM Order o JOIN FETCH o.items WHERE o.userId = :uid")
List<Order> findWithItems(@Param("uid") Long uid);

Verify every change

  1. EXPLAIN before/after on production-like data volume (staging).
  2. Load test: p99 API and total_exec_time for that query down.
  3. Queries-per-request metric dropped for the endpoint.
  4. No regression on write path (INSERT p99, lock waits).

Interview one-liner

“I use traces and pg_stat_statements to find the top queries, EXPLAIN ANALYZE to see scans and joins, then fix N+1 and over-fetching first, add indexes that match the filter order, and verify p99 and per-request query count under load—not just explain in dev.”

Related scenarios