Stale reads after an update (read replica lag)

Scenario

A user saves a profile, refreshes the page, and sees old data. Support sees “it fixed itself after a minute.” Writes go to the primary; reads use a replica that trails by seconds (or minutes) under load. The app is not wrong about SQL—it is wrong about consistency expectations. You need to measure lag and choose a read-your-writes strategy.

After reading, you should be able to:

Explain asynchronous replication and the read-your-writes problem.
Measure replica lag on PostgreSQL and MySQL/Aurora.
Route critical reads to primary after writes (or use versioned routing).
Separate “eventual OK” reads from “must be fresh” paths in API design.

Why — replicas are eventually consistent

A read replica applies the primary’s WAL/binlog asynchronously. Replication lag is normal: burst writes, large transactions, network, replica CPU, or maintenance. If the app reads from a replica immediately after writing to the primary, it may not see the commit yet—users call it a bug; operators call it eventual consistency. This overlaps with inconsistent logs and wrong cache, so confirm which read path was used.

When lag hurts

Flow	User impact
POST update → GET detail from replica	Old values shown
Create order → list orders from replica	Missing new order
Idempotency check on replica	Duplicate writes
Auth/session flags on replica	Random 401/403

When lag is acceptable

Analytics, search facets, dashboards (minutes stale OK).
Public catalog with TTL cache and documented delay.
Heavy read reports off dedicated replica (not user’s own just-written row).

What — measure lag and prove the path

Replication lag metrics
- PostgreSQL — pg_stat_replication replay_lag / write_lag; RDS ReplicaLag.
- MySQL — Seconds_Behind_Source (interpret with care); Aurora AuroraReplicaLag.
Correlate with incidents — lag spike at same time as “stale read” tickets; heavy write deploy or batch job on primary.
Log datasource target in app — structured field db_role=primary|replica on each query (routing framework or AOP).

Reproduce

-- After write on primary (app)
SELECT … FROM orders WHERE id = ?;  -- on replica session → row missing?

Rule out cache — stale Redis after write without invalidation — cache stale guide.
Rule out app race — two tabs, same user; not replication if both hit primary.

PostgreSQL lag check

SELECT application_name, state, sync_state,
       write_lag, flush_lag, replay_lag
FROM pg_stat_replication;

How — application strategies

1. Read-your-writes (most common fix)

After a write in a request or session, route subsequent reads to primary for a short window or for that user’s entities.

Request-scoped — same HTTP request: write then read → always primary.
Session cookie flag — “recent write” for 5–30s → primary for GETs.
Version token — client sends X-Read-After: <version>; server uses primary until replica catches up (advanced).

2. Routing in Spring (conceptual)

@Transactional(readOnly = true)
public OrderDto getOrder(Long id) {
  if (ReadContext.usePrimary()) {
    return primaryJdbc.query(...);
  }
  return replicaJdbc.query(...);
}

// After update:
ReadContext.setPrimaryForMillis(10_000);
return getOrder(id);

Libraries: Spring AbstractRoutingDataSource, ShardingSphere, AWS JDBC Driver failover + reader/writer URLs with custom router.

3. Classify endpoints in API design

Tier	Route	Example
Strong	Primary always	Checkout, balance, permissions
Read-your-writes	Primary after own write	Profile edit → view profile
Eventual	Replica OK	Leaderboard, recommendations

4. Infrastructure options

Synchronous replica (Postgres sync standby, Aurora global DB)—higher write latency, lower lag.
More replica capacity — reduces lag if replica was CPU-bound applying WAL.
Split batch writes — less replication backlog — lock contention.

5. Product / API honesty

If some reads are eventually consistent, document it and avoid UI that implies instant consistency (e.g. “Saved!” then immediate navigation to replica-backed list without primary routing).

Verify

Integration test: write on primary → read via app path → sees new data while lag is artificially high in test container.
Metrics: replica lag p99 within SLO; “stale read” support tickets drop.
Logs show primary used for classified endpoints after POST.

Interview one-liner

“Replicas lag asynchronously, so read-after-write must hit the primary or wait. I measure replay lag, log whether queries use primary or replica, and route session-critical reads to primary after writes—reserving replicas for eventually consistent workloads.”