Spikes in DB row lock wait time at peak traffic

Scenario

At peak, API latency jumps and DB metrics show lock wait time or waiting sessions piling up. CPU may still look fine—threads are blocked on UPDATE wallet SET balance … or SELECT … FOR UPDATE on the same rows. A recent deploy, batch job, or new promo increased contention on a hot key. You need to see who blocks whom and shorten the contention window.

After reading, you should be able to:

Why — writers (and some readers) serialize on rows

Relational databases use row-level locks so concurrent transactions do not corrupt data. When transaction A holds a lock on row R, transaction B that needs the same lock waits (or times out). Peak traffic + same account/order/inventory row → a queue forms; p99 latency spikes even if each SQL plan is fast.

Common triggers

TriggerExample
Hot row updatesGlobal counter, wallet balance, ticket inventory
SELECT FOR UPDATEPessimistic locking in service layer
Long transactionsExternal HTTP inside TX — holds locks — pool guide
Missing index on UPDATE WHERELocks more rows than intended (scan)
Batch job + OLTPNightly update collides with peak API
Isolation levelMySQL REPEATABLE READ gap/next-key locks; SERIALIZABLE
DB deadlockCycle of row locks; one victim rolled back — deadlock guide

What — who blocks whom

  1. Confirm lock waits (metrics) — PostgreSQL: db.locks, wait_event_type=Lock; MySQL: Innodb_row_lock_time, Performance Insights “db.wait_event”.
  2. PostgreSQL: blocking tree
    SELECT blocked.pid AS blocked_pid,
           blocked.query AS blocked_query,
           blocking.pid AS blocking_pid,
           blocking.query AS blocking_query,
           now() - blocked.xact_start AS blocked_xact_age
    FROM pg_stat_activity blocked
    JOIN pg_locks bl ON bl.pid = blocked.pid AND NOT bl.granted
    JOIN pg_locks kl ON kl.locktype = bl.locktype
      AND kl.database IS NOT DISTINCT FROM bl.database
      AND kl.relation IS NOT DISTINCT FROM bl.relation
      AND kl.page IS NOT DISTINCT FROM bl.page
      AND kl.tuple IS NOT DISTINCT FROM bl.tuple
      AND kl.virtualxid IS NOT DISTINCT FROM bl.virtualxid
      AND kl.transactionid IS NOT DISTINCT FROM bl.transactionid
      AND kl.classid IS NOT DISTINCT FROM bl.classid
      AND kl.objid IS NOT DISTINCT FROM bl.objid
      AND kl.objsubid IS NOT DISTINCT FROM bl.objsubid
      AND kl.pid != bl.pid
    JOIN pg_stat_activity blocking ON blocking.pid = kl.pid
    WHERE kl.granted;
  3. MySQL InnoDBSHOW ENGINE INNODB STATUS (latest TRANSACTIONS); Performance Schema data_locks / data_lock_waits (8.0+).
  4. Identify hot table/key — same WHERE id = ? or same primary key in blocked queries.
  5. What changed? — deploy added FOR UPDATE, new batch, promo on one SKU, traffic shift.
  6. App traces — long JDBC transaction span; many concurrent updates to same entity id.
  7. Not lock contention — if waits are IO or queries are seq scans — slow query guide.

Symptoms vs causes

ObservationLikely cause
Spike only at peak on one tableHot row / SKU / account
Spike when batch startsJob lock scope too wide
After deployNew pessimistic lock or longer TX — deploy regression
Deadlock errors in logsLock order; retry with backoff

How — reduce contention

1. Shorten lock duration (first)

2. Optimistic locking (many OLTP paths)

UPDATE account
SET balance = balance - :amt, version = version + 1
WHERE id = :id AND version = :expectedVersion;
-- 0 rows → retry or 409 Conflict

Avoids long-held FOR UPDATE when conflicts are rare.

3. Hot row / counter sharding

Split one logical counter into N physical rows (shard_id = hash(userId) % N); aggregate on read or periodic reconcile—trades write parallelism for complexity.

4. Serialize per business key in app

Partition queue: all updates for accountId=123 on one consumer—reduces DB lock fights at cost of throughput per account.

5. Work-queue patterns (PostgreSQL)

SELECT * FROM jobs
WHERE status = 'READY'
ORDER BY id
FOR UPDATE SKIP LOCKED
LIMIT 10;

Workers skip rows already locked—reduces wait chains for job tables.

6. Indexes and isolation

7. Operational

Verify

  1. Lock wait metric down at same peak RPS.
  2. Blocking query returns empty or short-lived during load test.
  3. p99 API latency improved; deadlock rate acceptable.

Interview one-liner

“I use DB views of blocking sessions to find the hot row and long transactions, shorten transactions and avoid unnecessary FOR UPDATE, then apply optimistic locking or sharding for hot counters—and I separate batch from peak OLTP.”

Related scenarios