Design a distributed cache (Redis-like)
A distributed in-memory cache sits between your application and databases—serving GET/SET in sub-millisecond time at massive QPS. Products like Redis, Memcached, and managed ElastiCache solve the same problem: partition data across nodes, replicate for availability, evict when memory is full, and expose simple data structures (strings, hashes, sorted sets) with optional TTL.
In interviews, this question tests sharding (consistent hashing), replication lag, cache-aside vs write-through, hot keys, and what breaks during node failure or cluster rebalancing. This guide matches the depth of the other classic articles here, including failure points and failure modes.
Design prompt
Design a distributed in-memory key-value cache that applications use to reduce database load.
Support horizontal scale, replication, TTL expiration, and high availability when nodes fail or the cluster grows.
What you should be able to do after reading:
- Shard keys with consistent hashing and explain virtual nodes.
- Compare cache patterns: cache-aside, read-through, write-through, write-back.
- Design primary-replica replication and failover tradeoffs.
- Handle hot keys, eviction, and TTL expiration at scale.
- Map failure points and failure modes (stampede, split brain, rebalance loss).
1. Requirements gathering
1.1 Functional requirements
- Basic KV —
GET,SET,DELon string values. - TTL — expire keys after N seconds.
- Data structures (optional depth) — hash, list, set, sorted set for leaderboards.
- Atomic operations —
INCR, compare-and-set, Lua scripts on single key. - Batch —
MGET/MSETfor multiple keys (may cross shards). - Cluster admin — add/remove nodes; rebalance keyspace.
Usually out of scope unless asked: full Redis persistence SLAs (AOF fsync every write), pub/sub at global scale, multi-key ACID transactions across shards, building a SQL query engine on top.
1.2 Non-functional requirements
- Latency — p99 < 1 ms within same AZ; < 5 ms cross-AZ read.
- Throughput — millions of ops/sec per cluster.
- Availability — survive single node failure; target 99.95%+ for cache tier.
- Scalability — add shards to increase memory and QPS linearly (mostly).
- Durability — cache is best-effort; DB is source of truth (cache-aside). Optional snapshot for warm restart.
- Consistency — eventual between primary and replicas; no cross-shard transactions in MVP.
- Memory bound — eviction when full; never OOM the host uncontrolled.
Assumptions for capacity math: 100-node cluster; 64 GB RAM per node (50 GB usable); 500 bytes avg value + key overhead → ~100M keys per node; 10M aggregate ops/sec; 3 replicas per shard for HA (1 primary + 2 replicas); 0.1% of keys are hot (10× traffic).
2. Capacity estimation
2.1 Memory
Nodes = 100 Usable RAM per node = 50 GB Total cluster capacity ≈ 5 TB Avg entry size = 500 B (key + value + metadata) Keys per node ≈ 50 GB / 500 B ≈ 100 million Total keys ≈ 10 billion (upper bound if perfectly balanced)
2.2 Throughput per node
Cluster target = 10,000,000 ops/sec Per node (100 shards) ≈ 100,000 ops/sec Single-threaded Redis-like core ≈ 100k–200k simple ops/sec per core → May need multi-threaded I/O or multiple shards per physical node
2.3 Network
10M ops/sec × 1 KB avg payload ≈ 10 GB/sec cluster egress+ingress Hot keys amplify single-host NIC limits
2.4 Replication overhead
Write amplification: each SET replicated to 2 replicas → 3× write bytes Read scaling: serve 80% reads from replicas → primary handles writes + 20% reads
3. High-level design
- Client library / proxy — routes key to correct shard; topology-aware.
- Shard node — in-memory hashtable; event loop; optional persistence thread.
- Replication — primary streams command log to replicas.
- Cluster manager — membership, health checks, failover, migration plan.
- Gossip / control plane — spread topology; epoch numbers to avoid stale routing.
flowchart TB
APP[Application servers]
CL[Smart client / proxy]
subgraph cluster [Cache cluster]
S1[Shard 1 primary]
R1[Shard 1 replica]
S2[Shard 2 primary]
R2[Shard 2 replica]
SN[Shard N primary]
end
CM[Cluster manager]
APP --> CL
CL --> S1
CL --> S2
CL --> SN
S1 --> R1
S2 --> R2
CM --> S1
CM --> S2
CM --> SN
Sharding: consistent hashing
hash_ring = 0 .. 2^32-1 Each physical node = 100 virtual nodes on ring key_slot = hash(key) mod 2^32 owner = first vnode clockwise from key_slot
Adding a node moves only K/n keys (approx)—not full reshuffle like hash(key) % N.
sequenceDiagram
participant C as Client
participant P as Primary shard
participant R as Replica
C->>P: SET user:42 profile_blob TTL 3600
P->>P: apply in memory
P->>R: replicate SET
P-->>C: OK
C->>R: GET user:42 (read from replica)
R-->>C: profile_blob
4. Data design on a shard
4.1 In-memory structure
dict[key] -> {
value: bytes,
expires_at: uint64 | null,
type: string | hash | zset,
lru_clock: uint32
}
4.2 TTL handling
- Lazy expire — check TTL on
GET; delete if expired. - Active expire — sample 20 keys every 100 ms; delete expired (Redis-style).
- Never block the event loop scanning all keys.
4.3 Eviction policies (when memory max reached)
| Policy | Behavior | Use when |
|---|---|---|
| allkeys-lru | Evict least recently used any key | General cache |
| volatile-lru | Evict LRU among keys with TTL | Mix of permanent + TTL |
| allkeys-lfu | Evict least frequently used | Hot key retention better |
| noeviction | Return errors on SET when full | Rate limiting counters |
5. API design (client protocol)
5.1 Core commands
GET key → value | NULL SET key value [EX seconds] [NX] DEL key → 1 | 0 INCR key → integer (atomic on single shard) MGET k1 k2 k3 → parallel to shards, merge EXPIRE key ttl
5.2 Cluster-aware redirect
MOVED 3999 10.0.1.15:6379 # permanent slot owner change ASK 3999 10.0.1.20:6379 # temporary during migration
Client updates slot map on MOVED; retries once on ASK.
5.3 Admin
CLUSTER NODES CLUSTER ADDSLOTS / MIGRATE key dest-node FAILOVER shard-3
6. Diving deep into key components
6.1 Cache usage patterns (application side)
| Pattern | Read | Write | Pros / cons |
|---|---|---|---|
| Cache-aside | App reads cache; on miss, read DB and populate | App writes DB then invalidates cache | Simple; stale window possible |
| Read-through | Cache loads from DB on miss | App writes through cache layer | Centralized logic |
| Write-through | — | Write cache + DB synchronously | Consistent; slower writes |
| Write-back | — | Write cache; async flush to DB | Fast; risk loss on crash |
6.2 Replication
- Primary-replica — all writes to primary; append replication log to replicas.
- Async replication — low latency; replicas may lag milliseconds to seconds.
- Sync quorum (optional) — wait for 1 replica ACK before OK; higher durability, higher latency.
- Failover — cluster manager promotes replica; bump
epoch; clients refresh map.
6.3 Hot keys
One viral key (celebrity profile, global config) overloads a single shard.
- Local application cache — 100 ms TTL in app memory.
- Key replication —
key#1,key#2,key#3on different shards; random read. - Read replicas — more replicas for that shard only.
6.4 Cache stampede (thundering herd)
Popular key expires → 10k threads miss → all hit DB.
- Probabilistic early expiration — refresh before TTL ends.
- Mutex / single-flight — only one loader per key; others wait or get stale value.
- Never expire hot keys — background refresh job.
6.5 Slot migration (scale out)
- Assign empty slots to new node.
- For each migrating slot: set keys to
IMPORTINGon dest,MIGRATINGon source. - Move keys in batches; concurrent requests get
ASKredirect. - Finish slot; update cluster map; broadcast
MOVED.
6.6 Optional persistence (warm restart)
- RDB snapshot — periodic fork + write disk; fast recovery; may lose last minutes.
- AOF log — append each write; configurable fsync; heavier disk.
- For pure cache tier, persistence optional—rebuild from DB on cold start.
6.7 Single-threaded vs multi-threaded
Redis uses one thread for commands (simple, no locks) + I/O threads in newer versions. Alternative: sharded multi-process on one machine (Memcached style) to use all cores.
7. Failure points
Failure points are where faults cause stale data, lost keys, outages, or backend overload. Cache failures should degrade gracefully—the database must survive cache loss.
| # | Failure point | What breaks | Detection | Mitigation design |
|---|---|---|---|---|
| FP1 | Primary crash before replicate | Recent writes lost | Clients see old value after failover | Sync repl for critical keys; app treats cache as optional |
| FP2 | Split brain (two primaries) | Divergent writes same slot | Inconsistent reads per key | Fencing tokens; quorum failover; epoch in cluster map |
| FP3 | Replica lag | Read your writes fails | User sees stale session after update | Read from primary after write; monotonic token |
| FP4 | Stale client slot map | Requests to wrong node | MOVED storm; elevated misses | MOVED/ASK handling; periodic map refresh |
| FP5 | Migration in progress | Key not found during move | Temporary 404 on GET | ASK redirects; dual-read source+dest until done |
| FP6 | Hot key on one shard | CPU/NIC saturated | p99 latency one shard only | Key fan-out; local cache; dedicated replica scale |
| FP7 | TTL expiry + cache miss | DB overload | DB RPS spike at :00 | Single-flight; jitter TTL; proactive refresh |
| FP8 | Eviction of hot key | Key gone despite demand | Miss rate spike one key | LFU policy; protect pinned keys; raise memory |
| FP9 | Memory limit / OOM | Process killed | Node flapping | maxmemory + eviction; monitor; no unbounded keys |
| FP10 | Invalidation race (cache-aside) | Del then old SET wins | Stale data served hours | TTL always; version in value; delete-after-write ordering |
flowchart LR
APP[App] -->|FP7 FP10| CL[Client]
CL -->|FP4 FP5| P[Primary]
P -->|FP1 FP3| R[Replica]
P -->|FP6| P
P -->|FP9| P
CM[Cluster manager] -->|FP2 FP5| P
8. Failure modes
8.1 Cache stampede
Symptom: DB latency spikes when hot key expires.
Cause: FP7 — thundering herd on miss.
Safe response: Single-flight loader; stagger TTL with jitter.
8.2 Stale read after write
Symptom: User updates profile; still sees old name.
Cause: FP3 — read replica lag; FP10 — invalidation race.
Safe response: Read-from-primary after write; version field in cached blob.
8.3 Lost writes on failover
Symptom: Session exists on one server, gone after failover.
Cause: FP1 — async replication.
Safe response: Session in DB; or sync repl; client retry.
8.4 Split brain divergent values
Symptom: Same key returns different values from different clients.
Cause: FP2 — network partition; dual primary.
Safe response: Quorum-based failover; fence old primary; short TTL on all keys.
8.5 Cluster-wide cold start
Symptom: Cache empty after restart; DB melts.
Cause: No persistence; mass restart.
Safe response: Warm cache from snapshot; gradual traffic ramp; circuit breaker on DB.
8.6 Hot key shard meltdown
Symptom: One shard at 100% CPU; rest idle.
Cause: FP6 — skewed access.
Safe response: Replicate key to multiple shards; app-side local cache.
8.7 Migration data loss
Symptom: Keys missing after adding node.
Cause: FP5 — incomplete migration; wrong slot state.
Safe response: Verify key count per slot; rollback migration; ASK until complete.
8.8 MOVED loop / wrong routing
Symptom: Clients spin on redirects.
Cause: FP4 — buggy client map.
Safe response: Max redirect count; force topology refresh from cluster manager.
8.9 OOM kill cascade
Symptom: Nodes die repeatedly under memory pressure.
Cause: FP9 — no eviction; large values.
Safe response: maxmemory policy; alert at 80%; reject large SETs.
8.10 Cache penetration (bogus keys)
Symptom: Attack queries random IDs; DB always hit.
Cause: Not a cache bug—missing negative cache.
Safe response: Cache empty results briefly; bloom filter in app.
| Failure mode | Primary failure points | Impact | Core mitigation |
|---|---|---|---|
| Cache stampede | FP7 | DB overload | Single-flight + jitter TTL |
| Stale read | FP3, FP10 | Bad UX | Read primary / versioning |
| Lost write on failover | FP1 | Data loss feel | Sync repl or don’t trust cache |
| Split brain | FP2 | Inconsistency | Quorum failover |
| Cold start | Restart | DB spike | Warm RDB; ramp traffic |
| Hot key | FP6 | Latency | Key fan-out |
| Migration loss | FP5 | Missing keys | Safe migration protocol |
| Routing loop | FP4 | Outage | Client map refresh |
| OOM cascade | FP9 | Shard down | Eviction policy |
| Cache penetration | App pattern | DB load | Negative cache |
9. Scalability, availability, and security
9.1 Scalability
- Add shards to scale memory and write QPS.
- Add read replicas per hot shard.
- Pipelining and connection pooling on clients.
- Avoid cross-slot multi-key transactions in MVP.
9.2 Availability
- Multi-AZ placement; replica in different AZ than primary.
- Automatic failover with health checks (heartbeat < 1 s detect, < 30 s promote).
- Application: treat cache miss as normal—never require cache for liveness.
9.3 Security
- TLS in transit; ACL per application user.
- Disable dangerous commands in prod (
FLUSHALL). - Do not store secrets unencrypted; cache is not a vault.
10. Tradeoffs recap
| Decision | Common choice | Why |
|---|---|---|
| Sharding | Consistent hashing + vnodes | Minimal key movement on scale |
| Replication | Async primary-replica | Latency vs durability |
| App pattern | Cache-aside | Simple; DB is truth |
| Eviction | allkeys-lru / allkeys-lfu | Protect hot items (LFU) |
| Persistence | Optional RDB for warm start | Cache not primary store |
11. How to present this in 45 minutes
- 5 min — requirements; clarify cache vs primary database.
- 7 min — capacity: memory, ops/sec per shard, replication factor.
- 10 min — consistent hashing; primary-replica; client routing diagram.
- 8 min — cache-aside, hot keys, stampede, eviction/TTL.
- 10 min — failure points + failure modes (split brain, stampede, migration).
- 5 min — slot migration, tradeoffs vs Memcached/Redis managed.
The one line to remember
A distributed cache is sharded in-memory state with consistent hashing, async replication for read scale and failover, and application patterns that treat it as optional acceleration—while you plan for hot keys, TTL stampedes, and topology changes as first-class failure domains.