Hot key expires and the database is overwhelmed

Scenario

A popular cache key (homepage config, product drop, feature flag blob) expires at once. Thousands of requests miss together, each loads the DB, and latency explodes—connection pools saturate. Cache hit rate drops from 99% to 0% for one minute. This is a cache stampede (thundering herd on cache miss).

After reading, you should be able to:

Recognize stampede from miss spikes, one hot key, and DB QPS correlated with TTL expiry.
Apply single-flight / per-key locking so one loader repopulates the cache.
Use TTL jitter, stale-while-revalidate, and prewarm to spread or hide expiry.
Avoid mass invalidation patterns that trigger stampedes — stale cache guide.

Why — many misses, one expensive load

Cache-aside on miss: if (!cache.has(key)) { data = db.load(); cache.put(key, data); }. If N concurrent threads miss the same key, N identical DB queries run until the first put completes. For a hot key at viral traffic, N can be thousands—DB becomes the bottleneck even though one row would suffice.

Common triggers

Fixed TTL — thousands of keys or one hot key expire at the same second.
Deploy / flush — FLUSHDB or global cache clear before restart.
Mass eviction after update without single-flight — fixing staleness triggers herd.
Cold start — new pods with empty local cache under load.
Thundering herd on popular entity — celebrity product, breaking news config key.

What — detect a stampede

Metrics spike together — cache miss rate ↑, DB QPS ↑, API p99 ↑, hikaricp.connections.pending ↑ — same minute.
One key dominates — Redis SLOWLOG / custom metric: top miss key config:global or product:12345.
Align with TTL or ops event — expiry at :00, deploy time, manual flush.
Duplicate queries — same SQL in DB logs hundreds of times per second with identical binds.
Not general overload — if all keys miss, suspect flush; if one key, stampede pattern.

How — prevent and mitigate

1. Single-flight (request coalescing) — primary fix

Only one thread loads DB per key; others wait or get in-flight result.

// Caffeine: built-in for LoadingCache
LoadingCache<String, Config> cache = Caffeine.newBuilder()
  .expireAfterWrite(Duration.ofMinutes(10))
  .build(key -> configRepo.load(key));  // one loader per key

// Manual: per-key lock
private final ConcurrentHashMap<String, Object> locks = new ConcurrentHashMap<>();
Object lock = locks.computeIfAbsent(key, k -> new Object());
synchronized (lock) {
  try {
    if (!redis.has(key)) {
      redis.set(key, db.load(key), ttl);
    }
  } finally {
    locks.remove(key, lock);
  }
}

2. Distributed lock (Redis) for cross-pod loads

if (redis.set("lock:load:" + key, "1", NX, EX=5)) {
  try {
    redis.set(key, db.load(key), ttl);
  } finally {
    redis.del("lock:load:" + key);
  }
} else {
  // brief wait + retry get, or return stale placeholder
}

3. TTL jitter — spread expiry

int jitterSeconds = ThreadLocalRandom.current().nextInt(0, 120);
redis.set(key, value, baseTtl + jitterSeconds);

Prevents millions of keys expiring in the same second after a deploy.

4. Stale-while-revalidate

Serve slightly stale value immediately; one async worker refreshes in background. User sees old data briefly—not DB meltdown.

5. Probabilistic early refresh

Before hard expiry, each read has small chance to refresh key—spreads load over time (used in some CDNs and memcached patterns).

6. Prewarm and never hard-expire hot keys

Background job refreshes hot keys before TTL.
“Logical expiry” in value with refresh window; physical key stays until replaced.

7. Protect the database

Rate limit reload path per key.
Circuit breaker if loader fails—return stale or degraded response — bulkheads.
Do not FLUSHALL in prod without traffic drain.

Anti-patterns

Global cache clear on every minor config change.
Zero TTL on hot keys “to avoid staleness” without single-flight.
Each pod loading independently with no shared Redis lock for global keys.

Verify

Load test: expire hot key under 1k RPS → DB QPS stays near single-digit loaders, not 1k.
Miss storm metric: max concurrent loaders per key = 1.
p99 stable through artificial expiry drill.

Interview one-liner

“A stampede is many concurrent cache misses on the same key each hitting the DB—I fix it with single-flight or a Redis load lock, add TTL jitter, and use stale-while-revalidate or prewarm for hot keys so expiry never becomes a synchronized miss.”