Database calls are slowing down the service
Scenario
API p99 jumped from 200ms to 2s. Traces show most time in JDBC spans. CPU on app servers is moderate—the database (or how the app uses it) is the bottleneck. You need to find which queries, why they are slow, and apply fixes that survive peak traffic without guessing indexes.
After reading, you should be able to:
- Find top slow queries from APM, slow logs, and
pg_stat_statements/ Performance Schema. - Read
EXPLAINfor seq scans, bad joins, and missing indexes. - Apply fixes: indexes, batching, pagination, fetch joins, caching—in the right order.
- Verify with before/after metrics, not only local micro-benchmarks.
Why — the DB amplifies application mistakes
A single HTTP request can execute one efficient query or hundreds of slow ones. The database does work proportional to rows examined, locks held, and round trips. Under load, a missing index or N+1 pattern turns into pool wait and thread pool exhaustion upstream.
Common causes
| Cause | Symptom |
|---|---|
| Full table / index scan | Latency grows with table size |
| N+1 queries | Query count ∝ list size in one API call |
| Over-fetching | SELECT *, huge lists, no LIMIT |
| Lock wait | Spikes at peak; DB “waiting” metrics — see row locks |
| Stale statistics | Planner chooses bad plan after bulk load |
| Connection overhead | Many short queries; pool too small — pool guide |
| Replica lag reads | Retries, inconsistent UX — replica lag |
| Disk / IOPS limit | DB CPU low but latency high; storage saturated |
What — find the expensive queries first
-
Confirm DB is on the critical path
— trace waterfall: JDBC span > 50% of request time; metric
jdbc.query.durationp99 up. - Application-level query count — log or metric queries-per-request; 1 vs 200 for same endpoint → N+1 suspicion.
-
Slow query log (database)
— PostgreSQL
log_min_duration_statement; MySQL slow log; RDS Performance Insights. -
Aggregate by normalized SQL
-- PostgreSQL: pg_stat_statements SELECT calls, mean_exec_time, total_exec_time, query FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 20;
-
Capture one slow statement + EXPLAIN
EXPLAIN (ANALYZE, BUFFERS) SELECT … FROM orders WHERE user_id = ? AND status = 'OPEN';
Look forSeq Scan, highrows, nested loops with huge inner rows. -
Check indexes exist for filter/join columns
— composite index order matches
WHERE+ORDER BYleading columns. - ORM visibility — enable Hibernate SQL stats / datasource-proxy / p6spy in staging; map SQL to repository method.
- Rule out pool wait vs query time — Hikari “connection acquired” wait vs statement execution in trace spans.
EXPLAIN red flags (PostgreSQL-oriented)
Seq Scanon large table with selectiveWHERE→ index candidate.Nested Loopwith millions of inner rows → join order or missing index.Sorton huge rowset → index that provides order, or sort in app with LIMIT.Buffers: readvery high → cold cache or table bigger than memory.
How — optimizations (prioritized)
1. Reduce work per request (often biggest win)
- Fix N+1 —
JOIN FETCH, entity graph, batch@BatchSize, DTO projection in one query. - Pagination — keyset pagination for feeds; never unbounded
findAll(). - Select only needed columns — DTO/query projection vs full entity graph.
- Batch writes — JDBC batch, bulk insert, not per-row commit in a loop.
2. Index deliberately
-- Example: filter by user_id + status, sort by created_at CREATE INDEX CONCURRENTLY idx_orders_user_status_created ON orders (user_id, status, created_at DESC);
- Use
CONCURRENTLYin prod PostgreSQL to avoid long write locks. - Measure: same EXPLAIN should show
Index Scan; p99 drops in metrics. - Watch write amplification and storage—too many indexes slow INSERT/UPDATE.
3. Caching and read path
- Cache read-heavy, tolerate-TTL data (Redis/Caffeine) with explicit invalidation.
- Route heavy reports to read replica or OLAP—not primary OLTP.
4. Database maintenance
ANALYZE/ update statistics after large data changes.- Partition very large tables by time or tenant.
- Archive cold rows so hot indexes stay small.
5. When the DB itself is undersized
If EXPLAIN is already optimal and latency tracks IOPS/CPU on RDS—scale instance, storage type, or connection pooling (PgBouncer), not more app threads.
JPA fetch join sketch
@Query("SELECT o FROM Order o JOIN FETCH o.items WHERE o.userId = :uid")
List<Order> findWithItems(@Param("uid") Long uid);
Verify every change
- EXPLAIN before/after on production-like data volume (staging).
- Load test: p99 API and
total_exec_timefor that query down. - Queries-per-request metric dropped for the endpoint.
- No regression on write path (INSERT p99, lock waits).
Interview one-liner
“I use traces and pg_stat_statements to find the top queries, EXPLAIN ANALYZE to see scans and joins, then fix N+1 and over-fetching first, add indexes that match the filter order, and verify p99 and per-request query count under load—not just explain in dev.”