Stale reads after an update (read replica lag)
Scenario
A user saves a profile, refreshes the page, and sees old data. Support sees “it fixed itself after a minute.” Writes go to the primary; reads use a replica that trails by seconds (or minutes) under load. The app is not wrong about SQL—it is wrong about consistency expectations. You need to measure lag and choose a read-your-writes strategy.
After reading, you should be able to:
- Explain asynchronous replication and the read-your-writes problem.
- Measure replica lag on PostgreSQL and MySQL/Aurora.
- Route critical reads to primary after writes (or use versioned routing).
- Separate “eventual OK” reads from “must be fresh” paths in API design.
Why — replicas are eventually consistent
A read replica applies the primary’s WAL/binlog asynchronously. Replication lag is normal: burst writes, large transactions, network, replica CPU, or maintenance. If the app reads from a replica immediately after writing to the primary, it may not see the commit yet—users call it a bug; operators call it eventual consistency. This overlaps with inconsistent logs and wrong cache, so confirm which read path was used.
When lag hurts
| Flow | User impact |
|---|---|
| POST update → GET detail from replica | Old values shown |
| Create order → list orders from replica | Missing new order |
| Idempotency check on replica | Duplicate writes |
| Auth/session flags on replica | Random 401/403 |
When lag is acceptable
- Analytics, search facets, dashboards (minutes stale OK).
- Public catalog with TTL cache and documented delay.
- Heavy read reports off dedicated replica (not user’s own just-written row).
What — measure lag and prove the path
-
Replication lag metrics
- PostgreSQL —
pg_stat_replicationreplay_lag/write_lag; RDSReplicaLag. - MySQL —
Seconds_Behind_Source(interpret with care); AuroraAuroraReplicaLag.
- PostgreSQL —
- Correlate with incidents — lag spike at same time as “stale read” tickets; heavy write deploy or batch job on primary.
-
Log datasource target in app
— structured field
db_role=primary|replicaon each query (routing framework or AOP). -
Reproduce
-- After write on primary (app) SELECT … FROM orders WHERE id = ?; -- on replica session → row missing?
- Rule out cache — stale Redis after write without invalidation — cache stale guide.
- Rule out app race — two tabs, same user; not replication if both hit primary.
PostgreSQL lag check
SELECT application_name, state, sync_state,
write_lag, flush_lag, replay_lag
FROM pg_stat_replication;
How — application strategies
1. Read-your-writes (most common fix)
After a write in a request or session, route subsequent reads to primary for a short window or for that user’s entities.
- Request-scoped — same HTTP request: write then read → always primary.
- Session cookie flag — “recent write” for 5–30s → primary for GETs.
- Version token — client sends
X-Read-After: <version>; server uses primary until replica catches up (advanced).
2. Routing in Spring (conceptual)
@Transactional(readOnly = true)
public OrderDto getOrder(Long id) {
if (ReadContext.usePrimary()) {
return primaryJdbc.query(...);
}
return replicaJdbc.query(...);
}
// After update:
ReadContext.setPrimaryForMillis(10_000);
return getOrder(id);
Libraries: Spring AbstractRoutingDataSource, ShardingSphere, AWS JDBC Driver failover + reader/writer URLs with custom router.
3. Classify endpoints in API design
| Tier | Route | Example |
|---|---|---|
| Strong | Primary always | Checkout, balance, permissions |
| Read-your-writes | Primary after own write | Profile edit → view profile |
| Eventual | Replica OK | Leaderboard, recommendations |
4. Infrastructure options
- Synchronous replica (Postgres sync standby, Aurora global DB)—higher write latency, lower lag.
- More replica capacity — reduces lag if replica was CPU-bound applying WAL.
- Split batch writes — less replication backlog — lock contention.
5. Product / API honesty
If some reads are eventually consistent, document it and avoid UI that implies instant consistency (e.g. “Saved!” then immediate navigation to replica-backed list without primary routing).
Verify
- Integration test: write on primary → read via app path → sees new data while lag is artificially high in test container.
- Metrics: replica lag p99 within SLO; “stale read” support tickets drop.
- Logs show primary used for classified endpoints after POST.
Interview one-liner
“Replicas lag asynchronously, so read-after-write must hit the primary or wait. I measure replay lag, log whether queries use primary or replica, and route session-critical reads to primary after writes—reserving replicas for eventually consistent workloads.”