Database calls are slowing down the service

Scenario

API p99 jumped from 200ms to 2s. Traces show most time in JDBC spans. CPU on app servers is moderate—the database (or how the app uses it) is the bottleneck. You need to find which queries, why they are slow, and apply fixes that survive peak traffic without guessing indexes.

After reading, you should be able to:

Find top slow queries from APM, slow logs, and pg_stat_statements / Performance Schema.
Read EXPLAIN for seq scans, bad joins, and missing indexes.
Apply fixes: indexes, batching, pagination, fetch joins, caching—in the right order.
Verify with before/after metrics, not only local micro-benchmarks.

Why — the DB amplifies application mistakes

A single HTTP request can execute one efficient query or hundreds of slow ones. The database does work proportional to rows examined, locks held, and round trips. Under load, a missing index or N+1 pattern turns into pool wait and thread pool exhaustion upstream.

Common causes

Cause	Symptom
Full table / index scan	Latency grows with table size
N+1 queries	Query count ∝ list size in one API call
Over-fetching	`SELECT *`, huge lists, no LIMIT
Lock wait	Spikes at peak; DB “waiting” metrics — see row locks
Stale statistics	Planner chooses bad plan after bulk load
Connection overhead	Many short queries; pool too small — pool guide
Replica lag reads	Retries, inconsistent UX — replica lag
Disk / IOPS limit	DB CPU low but latency high; storage saturated

What — find the expensive queries first

Confirm DB is on the critical path — trace waterfall: JDBC span > 50% of request time; metric jdbc.query.duration p99 up.
Application-level query count — log or metric queries-per-request; 1 vs 200 for same endpoint → N+1 suspicion.
Slow query log (database) — PostgreSQL log_min_duration_statement; MySQL slow log; RDS Performance Insights.

Aggregate by normalized SQL

-- PostgreSQL: pg_stat_statements
SELECT calls, mean_exec_time, total_exec_time, query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

Capture one slow statement + EXPLAIN
```
EXPLAIN (ANALYZE, BUFFERS)
SELECT … FROM orders WHERE user_id = ? AND status = 'OPEN';
```
Look for Seq Scan, high rows, nested loops with huge inner rows.
Check indexes exist for filter/join columns — composite index order matches WHERE + ORDER BY leading columns.
ORM visibility — enable Hibernate SQL stats / datasource-proxy / p6spy in staging; map SQL to repository method.
Rule out pool wait vs query time — Hikari “connection acquired” wait vs statement execution in trace spans.

EXPLAIN red flags (PostgreSQL-oriented)

Seq Scan on large table with selective WHERE → index candidate.
Nested Loop with millions of inner rows → join order or missing index.
Sort on huge rowset → index that provides order, or sort in app with LIMIT.
Buffers: read very high → cold cache or table bigger than memory.

How — optimizations (prioritized)

1. Reduce work per request (often biggest win)

Fix N+1 — JOIN FETCH, entity graph, batch @BatchSize, DTO projection in one query.
Pagination — keyset pagination for feeds; never unbounded findAll().
Select only needed columns — DTO/query projection vs full entity graph.
Batch writes — JDBC batch, bulk insert, not per-row commit in a loop.

2. Index deliberately

-- Example: filter by user_id + status, sort by created_at
CREATE INDEX CONCURRENTLY idx_orders_user_status_created
  ON orders (user_id, status, created_at DESC);

Use CONCURRENTLY in prod PostgreSQL to avoid long write locks.
Measure: same EXPLAIN should show Index Scan; p99 drops in metrics.
Watch write amplification and storage—too many indexes slow INSERT/UPDATE.

3. Caching and read path

Cache read-heavy, tolerate-TTL data (Redis/Caffeine) with explicit invalidation.
Route heavy reports to read replica or OLAP—not primary OLTP.

4. Database maintenance

ANALYZE / update statistics after large data changes.
Partition very large tables by time or tenant.
Archive cold rows so hot indexes stay small.

5. When the DB itself is undersized

If EXPLAIN is already optimal and latency tracks IOPS/CPU on RDS—scale instance, storage type, or connection pooling (PgBouncer), not more app threads.

JPA fetch join sketch

@Query("SELECT o FROM Order o JOIN FETCH o.items WHERE o.userId = :uid")
List<Order> findWithItems(@Param("uid") Long uid);

Verify every change

EXPLAIN before/after on production-like data volume (staging).
Load test: p99 API and total_exec_time for that query down.
Queries-per-request metric dropped for the endpoint.
No regression on write path (INSERT p99, lock waits).

Interview one-liner

“I use traces and pg_stat_statements to find the top queries, EXPLAIN ANALYZE to see scans and joins, then fix N+1 and over-fetching first, add indexes that match the filter order, and verify p99 and per-request query count under load—not just explain in dev.”