Spikes in DB row lock wait time at peak traffic
Scenario
At peak, API latency jumps and DB metrics show lock wait time or waiting sessions piling up. CPU may still look fine—threads are blocked on UPDATE wallet SET balance … or SELECT … FOR UPDATE on the same rows. A recent deploy, batch job, or new promo increased contention on a hot key. You need to see who blocks whom and shorten the contention window.
After reading, you should be able to:
- Distinguish row lock waits from slow scans and pool exhaustion.
- Query blocking sessions in PostgreSQL and MySQL/InnoDB.
- Mitigate with shorter transactions, optimistic locking, and hot-row sharding.
- Relate DB deadlocks to app retry and lock ordering — JVM deadlock guide.
Why — writers (and some readers) serialize on rows
Relational databases use row-level locks so concurrent transactions do not corrupt data. When transaction A holds a lock on row R, transaction B that needs the same lock waits (or times out). Peak traffic + same account/order/inventory row → a queue forms; p99 latency spikes even if each SQL plan is fast.
Common triggers
| Trigger | Example |
|---|---|
| Hot row updates | Global counter, wallet balance, ticket inventory |
SELECT FOR UPDATE | Pessimistic locking in service layer |
| Long transactions | External HTTP inside TX — holds locks — pool guide |
| Missing index on UPDATE WHERE | Locks more rows than intended (scan) |
| Batch job + OLTP | Nightly update collides with peak API |
| Isolation level | MySQL REPEATABLE READ gap/next-key locks; SERIALIZABLE |
| DB deadlock | Cycle of row locks; one victim rolled back — deadlock guide |
What — who blocks whom
-
Confirm lock waits (metrics)
— PostgreSQL:
db.locks,wait_event_type=Lock; MySQL:Innodb_row_lock_time, Performance Insights “db.wait_event”. -
PostgreSQL: blocking tree
SELECT blocked.pid AS blocked_pid, blocked.query AS blocked_query, blocking.pid AS blocking_pid, blocking.query AS blocking_query, now() - blocked.xact_start AS blocked_xact_age FROM pg_stat_activity blocked JOIN pg_locks bl ON bl.pid = blocked.pid AND NOT bl.granted JOIN pg_locks kl ON kl.locktype = bl.locktype AND kl.database IS NOT DISTINCT FROM bl.database AND kl.relation IS NOT DISTINCT FROM bl.relation AND kl.page IS NOT DISTINCT FROM bl.page AND kl.tuple IS NOT DISTINCT FROM bl.tuple AND kl.virtualxid IS NOT DISTINCT FROM bl.virtualxid AND kl.transactionid IS NOT DISTINCT FROM bl.transactionid AND kl.classid IS NOT DISTINCT FROM bl.classid AND kl.objid IS NOT DISTINCT FROM bl.objid AND kl.objsubid IS NOT DISTINCT FROM bl.objsubid AND kl.pid != bl.pid JOIN pg_stat_activity blocking ON blocking.pid = kl.pid WHERE kl.granted; -
MySQL InnoDB
—
SHOW ENGINE INNODB STATUS(latest TRANSACTIONS); Performance Schemadata_locks/data_lock_waits(8.0+). -
Identify hot table/key
— same
WHERE id = ?or same primary key in blocked queries. -
What changed?
— deploy added
FOR UPDATE, new batch, promo on one SKU, traffic shift. - App traces — long JDBC transaction span; many concurrent updates to same entity id.
-
Not lock contention
— if waits are
IOor queries are seq scans — slow query guide.
Symptoms vs causes
| Observation | Likely cause |
|---|---|
| Spike only at peak on one table | Hot row / SKU / account |
| Spike when batch starts | Job lock scope too wide |
| After deploy | New pessimistic lock or longer TX — deploy regression |
| Deadlock errors in logs | Lock order; retry with backoff |
How — reduce contention
1. Shorten lock duration (first)
- Remove remote calls from
@Transactionalmethods. - Commit before publishing to Kafka or calling email service.
- Split read-only work outside write transaction.
2. Optimistic locking (many OLTP paths)
UPDATE account SET balance = balance - :amt, version = version + 1 WHERE id = :id AND version = :expectedVersion; -- 0 rows → retry or 409 Conflict
Avoids long-held FOR UPDATE when conflicts are rare.
3. Hot row / counter sharding
Split one logical counter into N physical rows (shard_id = hash(userId) % N); aggregate on read or periodic reconcile—trades write parallelism for complexity.
4. Serialize per business key in app
Partition queue: all updates for accountId=123 on one consumer—reduces DB lock fights at cost of throughput per account.
5. Work-queue patterns (PostgreSQL)
SELECT * FROM jobs WHERE status = 'READY' ORDER BY id FOR UPDATE SKIP LOCKED LIMIT 10;
Workers skip rows already locked—reduces wait chains for job tables.
6. Indexes and isolation
- Ensure
UPDATE/DELETEWHEREuses index—locks only target rows. - Review if
READ COMMITTEDis acceptable vsREPEATABLE READ(MySQL gap locks).
7. Operational
- Run heavy batch off peak or on replica dedicated to batch.
- Idempotent retry on deadlock/serialization failure with capped backoff.
Verify
- Lock wait metric down at same peak RPS.
- Blocking query returns empty or short-lived during load test.
- p99 API latency improved; deadlock rate acceptable.
Interview one-liner
“I use DB views of blocking sessions to find the hot row and long transactions, shorten transactions and avoid unnecessary FOR UPDATE, then apply optimistic locking or sharding for hot counters—and I separate batch from peak OLTP.”