Spikes in DB row lock wait time at peak traffic

Scenario

At peak, API latency jumps and DB metrics show lock wait time or waiting sessions piling up. CPU may still look fine—threads are blocked on UPDATE wallet SET balance … or SELECT … FOR UPDATE on the same rows. A recent deploy, batch job, or new promo increased contention on a hot key. You need to see who blocks whom and shorten the contention window.

After reading, you should be able to:

Distinguish row lock waits from slow scans and pool exhaustion.
Query blocking sessions in PostgreSQL and MySQL/InnoDB.
Mitigate with shorter transactions, optimistic locking, and hot-row sharding.
Relate DB deadlocks to app retry and lock ordering — JVM deadlock guide.

Why — writers (and some readers) serialize on rows

Relational databases use row-level locks so concurrent transactions do not corrupt data. When transaction A holds a lock on row R, transaction B that needs the same lock waits (or times out). Peak traffic + same account/order/inventory row → a queue forms; p99 latency spikes even if each SQL plan is fast.

Common triggers

Trigger	Example
Hot row updates	Global counter, wallet balance, ticket inventory
`SELECT FOR UPDATE`	Pessimistic locking in service layer
Long transactions	External HTTP inside TX — holds locks — pool guide
Missing index on UPDATE WHERE	Locks more rows than intended (scan)
Batch job + OLTP	Nightly update collides with peak API
Isolation level	MySQL `REPEATABLE READ` gap/next-key locks; `SERIALIZABLE`
DB deadlock	Cycle of row locks; one victim rolled back — deadlock guide

What — who blocks whom

Confirm lock waits (metrics) — PostgreSQL: db.locks, wait_event_type=Lock; MySQL: Innodb_row_lock_time, Performance Insights “db.wait_event”.

PostgreSQL: blocking tree

SELECT blocked.pid AS blocked_pid,
       blocked.query AS blocked_query,
       blocking.pid AS blocking_pid,
       blocking.query AS blocking_query,
       now() - blocked.xact_start AS blocked_xact_age
FROM pg_stat_activity blocked
JOIN pg_locks bl ON bl.pid = blocked.pid AND NOT bl.granted
JOIN pg_locks kl ON kl.locktype = bl.locktype
  AND kl.database IS NOT DISTINCT FROM bl.database
  AND kl.relation IS NOT DISTINCT FROM bl.relation
  AND kl.page IS NOT DISTINCT FROM bl.page
  AND kl.tuple IS NOT DISTINCT FROM bl.tuple
  AND kl.virtualxid IS NOT DISTINCT FROM bl.virtualxid
  AND kl.transactionid IS NOT DISTINCT FROM bl.transactionid
  AND kl.classid IS NOT DISTINCT FROM bl.classid
  AND kl.objid IS NOT DISTINCT FROM bl.objid
  AND kl.objsubid IS NOT DISTINCT FROM bl.objsubid
  AND kl.pid != bl.pid
JOIN pg_stat_activity blocking ON blocking.pid = kl.pid
WHERE kl.granted;

MySQL InnoDB — SHOW ENGINE INNODB STATUS (latest TRANSACTIONS); Performance Schema data_locks / data_lock_waits (8.0+).
Identify hot table/key — same WHERE id = ? or same primary key in blocked queries.
What changed? — deploy added FOR UPDATE, new batch, promo on one SKU, traffic shift.
App traces — long JDBC transaction span; many concurrent updates to same entity id.
Not lock contention — if waits are IO or queries are seq scans — slow query guide.

Symptoms vs causes

Observation	Likely cause
Spike only at peak on one table	Hot row / SKU / account
Spike when batch starts	Job lock scope too wide
After deploy	New pessimistic lock or longer TX — deploy regression
Deadlock errors in logs	Lock order; retry with backoff

How — reduce contention

1. Shorten lock duration (first)

Remove remote calls from @Transactional methods.
Commit before publishing to Kafka or calling email service.
Split read-only work outside write transaction.

2. Optimistic locking (many OLTP paths)

UPDATE account
SET balance = balance - :amt, version = version + 1
WHERE id = :id AND version = :expectedVersion;
-- 0 rows → retry or 409 Conflict

Avoids long-held FOR UPDATE when conflicts are rare.

3. Hot row / counter sharding

Split one logical counter into N physical rows (shard_id = hash(userId) % N); aggregate on read or periodic reconcile—trades write parallelism for complexity.

4. Serialize per business key in app

Partition queue: all updates for accountId=123 on one consumer—reduces DB lock fights at cost of throughput per account.

5. Work-queue patterns (PostgreSQL)

SELECT * FROM jobs
WHERE status = 'READY'
ORDER BY id
FOR UPDATE SKIP LOCKED
LIMIT 10;

Workers skip rows already locked—reduces wait chains for job tables.

6. Indexes and isolation

Ensure UPDATE/DELETE WHERE uses index—locks only target rows.
Review if READ COMMITTED is acceptable vs REPEATABLE READ (MySQL gap locks).

7. Operational

Run heavy batch off peak or on replica dedicated to batch.
Idempotent retry on deadlock/serialization failure with capped backoff.

Verify

Lock wait metric down at same peak RPS.
Blocking query returns empty or short-lived during load test.
p99 API latency improved; deadlock rate acceptable.

Interview one-liner

“I use DB views of blocking sessions to find the hot row and long transactions, shorten transactions and avoid unnecessary FOR UPDATE, then apply optimistic locking or sharding for hot counters—and I separate batch from peak OLTP.”