Stale reads after an update (read replica lag)

Scenario

A user saves a profile, refreshes the page, and sees old data. Support sees “it fixed itself after a minute.” Writes go to the primary; reads use a replica that trails by seconds (or minutes) under load. The app is not wrong about SQL—it is wrong about consistency expectations. You need to measure lag and choose a read-your-writes strategy.

After reading, you should be able to:

Why — replicas are eventually consistent

A read replica applies the primary’s WAL/binlog asynchronously. Replication lag is normal: burst writes, large transactions, network, replica CPU, or maintenance. If the app reads from a replica immediately after writing to the primary, it may not see the commit yet—users call it a bug; operators call it eventual consistency. This overlaps with inconsistent logs and wrong cache, so confirm which read path was used.

When lag hurts

FlowUser impact
POST update → GET detail from replicaOld values shown
Create order → list orders from replicaMissing new order
Idempotency check on replicaDuplicate writes
Auth/session flags on replicaRandom 401/403

When lag is acceptable

What — measure lag and prove the path

  1. Replication lag metrics
    • PostgreSQLpg_stat_replication replay_lag / write_lag; RDS ReplicaLag.
    • MySQLSeconds_Behind_Source (interpret with care); Aurora AuroraReplicaLag.
  2. Correlate with incidents — lag spike at same time as “stale read” tickets; heavy write deploy or batch job on primary.
  3. Log datasource target in app — structured field db_role=primary|replica on each query (routing framework or AOP).
  4. Reproduce
    -- After write on primary (app)
    SELECT … FROM orders WHERE id = ?;  -- on replica session → row missing?
  5. Rule out cache — stale Redis after write without invalidation — cache stale guide.
  6. Rule out app race — two tabs, same user; not replication if both hit primary.

PostgreSQL lag check

SELECT application_name, state, sync_state,
       write_lag, flush_lag, replay_lag
FROM pg_stat_replication;

How — application strategies

1. Read-your-writes (most common fix)

After a write in a request or session, route subsequent reads to primary for a short window or for that user’s entities.

2. Routing in Spring (conceptual)

@Transactional(readOnly = true)
public OrderDto getOrder(Long id) {
  if (ReadContext.usePrimary()) {
    return primaryJdbc.query(...);
  }
  return replicaJdbc.query(...);
}

// After update:
ReadContext.setPrimaryForMillis(10_000);
return getOrder(id);

Libraries: Spring AbstractRoutingDataSource, ShardingSphere, AWS JDBC Driver failover + reader/writer URLs with custom router.

3. Classify endpoints in API design

TierRouteExample
StrongPrimary alwaysCheckout, balance, permissions
Read-your-writesPrimary after own writeProfile edit → view profile
EventualReplica OKLeaderboard, recommendations

4. Infrastructure options

5. Product / API honesty

If some reads are eventually consistent, document it and avoid UI that implies instant consistency (e.g. “Saved!” then immediate navigation to replica-backed list without primary routing).

Verify

  1. Integration test: write on primary → read via app path → sees new data while lag is artificially high in test container.
  2. Metrics: replica lag p99 within SLO; “stale read” support tickets drop.
  3. Logs show primary used for classified endpoints after POST.

Interview one-liner

“Replicas lag asynchronously, so read-after-write must hit the primary or wait. I measure replay lag, log whether queries use primary or replica, and route session-critical reads to primary after writes—reserving replicas for eventually consistent workloads.”

Related scenarios