Design a distributed cache (Redis-like)

A distributed in-memory cache sits between your application and databases—serving GET/SET in sub-millisecond time at massive QPS. Products like Redis, Memcached, and managed ElastiCache solve the same problem: partition data across nodes, replicate for availability, evict when memory is full, and expose simple data structures (strings, hashes, sorted sets) with optional TTL.

In interviews, this question tests sharding (consistent hashing), replication lag, cache-aside vs write-through, hot keys, and what breaks during node failure or cluster rebalancing. This guide matches the depth of the other classic articles here, including failure points and failure modes.

Design prompt

Design a distributed in-memory key-value cache that applications use to reduce database load.

Support horizontal scale, replication, TTL expiration, and high availability when nodes fail or the cluster grows.

What you should be able to do after reading:

Shard keys with consistent hashing and explain virtual nodes.
Compare cache patterns: cache-aside, read-through, write-through, write-back.
Design primary-replica replication and failover tradeoffs.
Handle hot keys, eviction, and TTL expiration at scale.
Map failure points and failure modes (stampede, split brain, rebalance loss).

1. Requirements gathering

1.1 Functional requirements

Basic KV — GET, SET, DEL on string values.
TTL — expire keys after N seconds.
Data structures (optional depth) — hash, list, set, sorted set for leaderboards.
Atomic operations — INCR, compare-and-set, Lua scripts on single key.
Batch — MGET/MSET for multiple keys (may cross shards).
Cluster admin — add/remove nodes; rebalance keyspace.

Usually out of scope unless asked: full Redis persistence SLAs (AOF fsync every write), pub/sub at global scale, multi-key ACID transactions across shards, building a SQL query engine on top.

1.2 Non-functional requirements

Latency — p99 < 1 ms within same AZ; < 5 ms cross-AZ read.
Throughput — millions of ops/sec per cluster.
Availability — survive single node failure; target 99.95%+ for cache tier.
Scalability — add shards to increase memory and QPS linearly (mostly).
Durability — cache is best-effort; DB is source of truth (cache-aside). Optional snapshot for warm restart.
Consistency — eventual between primary and replicas; no cross-shard transactions in MVP.
Memory bound — eviction when full; never OOM the host uncontrolled.

Assumptions for capacity math: 100-node cluster; 64 GB RAM per node (50 GB usable); 500 bytes avg value + key overhead → ~100M keys per node; 10M aggregate ops/sec; 3 replicas per shard for HA (1 primary + 2 replicas); 0.1% of keys are hot (10× traffic).

2. Capacity estimation

2.1 Memory

Nodes = 100
Usable RAM per node = 50 GB
Total cluster capacity ≈ 5 TB

Avg entry size = 500 B (key + value + metadata)
Keys per node ≈ 50 GB / 500 B ≈ 100 million
Total keys ≈ 10 billion (upper bound if perfectly balanced)

2.2 Throughput per node

Cluster target = 10,000,000 ops/sec
Per node (100 shards) ≈ 100,000 ops/sec

Single-threaded Redis-like core ≈ 100k–200k simple ops/sec per core
→ May need multi-threaded I/O or multiple shards per physical node

2.3 Network

10M ops/sec × 1 KB avg payload ≈ 10 GB/sec cluster egress+ingress
Hot keys amplify single-host NIC limits

2.4 Replication overhead

Write amplification: each SET replicated to 2 replicas → 3× write bytes
Read scaling: serve 80% reads from replicas → primary handles writes + 20% reads

3. High-level design

Client library / proxy — routes key to correct shard; topology-aware.
Shard node — in-memory hashtable; event loop; optional persistence thread.
Replication — primary streams command log to replicas.
Cluster manager — membership, health checks, failover, migration plan.
Gossip / control plane — spread topology; epoch numbers to avoid stale routing.

flowchart TB
  APP[Application servers]
  CL[Smart client / proxy]
  subgraph cluster [Cache cluster]
    S1[Shard 1 primary]
    R1[Shard 1 replica]
    S2[Shard 2 primary]
    R2[Shard 2 replica]
    SN[Shard N primary]
  end
  CM[Cluster manager]
  APP --> CL
  CL --> S1
  CL --> S2
  CL --> SN
  S1 --> R1
  S2 --> R2
  CM --> S1
  CM --> S2
  CM --> SN

Sharding: consistent hashing

hash_ring = 0 .. 2^32-1
Each physical node = 100 virtual nodes on ring
key_slot = hash(key) mod 2^32
owner = first vnode clockwise from key_slot

Adding a node moves only K/n keys (approx)—not full reshuffle like hash(key) % N.

sequenceDiagram
  participant C as Client
  participant P as Primary shard
  participant R as Replica
  C->>P: SET user:42 profile_blob TTL 3600
  P->>P: apply in memory
  P->>R: replicate SET
  P-->>C: OK
  C->>R: GET user:42 (read from replica)
  R-->>C: profile_blob

4. Data design on a shard

4.1 In-memory structure

dict[key] -> {
  value: bytes,
  expires_at: uint64 | null,
  type: string | hash | zset,
  lru_clock: uint32
}

4.2 TTL handling

Lazy expire — check TTL on GET; delete if expired.
Active expire — sample 20 keys every 100 ms; delete expired (Redis-style).
Never block the event loop scanning all keys.

4.3 Eviction policies (when memory max reached)

Policy	Behavior	Use when
allkeys-lru	Evict least recently used any key	General cache
volatile-lru	Evict LRU among keys with TTL	Mix of permanent + TTL
allkeys-lfu	Evict least frequently used	Hot key retention better
noeviction	Return errors on SET when full	Rate limiting counters

5. API design (client protocol)

5.1 Core commands

GET key          → value | NULL
SET key value [EX seconds] [NX]
DEL key          → 1 | 0
INCR key         → integer (atomic on single shard)
MGET k1 k2 k3    → parallel to shards, merge
EXPIRE key ttl

5.2 Cluster-aware redirect

MOVED 3999 10.0.1.15:6379   # permanent slot owner change
ASK  3999 10.0.1.20:6379   # temporary during migration

Client updates slot map on MOVED; retries once on ASK.

5.3 Admin

CLUSTER NODES
CLUSTER ADDSLOTS / MIGRATE key dest-node
FAILOVER shard-3

6. Diving deep into key components

6.1 Cache usage patterns (application side)

Pattern	Read	Write	Pros / cons
Cache-aside	App reads cache; on miss, read DB and populate	App writes DB then invalidates cache	Simple; stale window possible
Read-through	Cache loads from DB on miss	App writes through cache layer	Centralized logic
Write-through	—	Write cache + DB synchronously	Consistent; slower writes
Write-back	—	Write cache; async flush to DB	Fast; risk loss on crash

6.2 Replication

Primary-replica — all writes to primary; append replication log to replicas.
Async replication — low latency; replicas may lag milliseconds to seconds.
Sync quorum (optional) — wait for 1 replica ACK before OK; higher durability, higher latency.
Failover — cluster manager promotes replica; bump epoch; clients refresh map.

6.3 Hot keys

One viral key (celebrity profile, global config) overloads a single shard.

Local application cache — 100 ms TTL in app memory.
Key replication — key#1, key#2, key#3 on different shards; random read.
Read replicas — more replicas for that shard only.

6.4 Cache stampede (thundering herd)

Popular key expires → 10k threads miss → all hit DB.

Probabilistic early expiration — refresh before TTL ends.
Mutex / single-flight — only one loader per key; others wait or get stale value.
Never expire hot keys — background refresh job.

6.5 Slot migration (scale out)

Assign empty slots to new node.
For each migrating slot: set keys to IMPORTING on dest, MIGRATING on source.
Move keys in batches; concurrent requests get ASK redirect.
Finish slot; update cluster map; broadcast MOVED.

6.6 Optional persistence (warm restart)

RDB snapshot — periodic fork + write disk; fast recovery; may lose last minutes.
AOF log — append each write; configurable fsync; heavier disk.
For pure cache tier, persistence optional—rebuild from DB on cold start.

6.7 Single-threaded vs multi-threaded

Redis uses one thread for commands (simple, no locks) + I/O threads in newer versions. Alternative: sharded multi-process on one machine (Memcached style) to use all cores.

7. Failure points

Failure points are where faults cause stale data, lost keys, outages, or backend overload. Cache failures should degrade gracefully—the database must survive cache loss.

#	Failure point	What breaks	Detection	Mitigation design
FP1	Primary crash before replicate	Recent writes lost	Clients see old value after failover	Sync repl for critical keys; app treats cache as optional
FP2	Split brain (two primaries)	Divergent writes same slot	Inconsistent reads per key	Fencing tokens; quorum failover; epoch in cluster map
FP3	Replica lag	Read your writes fails	User sees stale session after update	Read from primary after write; monotonic token
FP4	Stale client slot map	Requests to wrong node	MOVED storm; elevated misses	MOVED/ASK handling; periodic map refresh
FP5	Migration in progress	Key not found during move	Temporary 404 on GET	ASK redirects; dual-read source+dest until done
FP6	Hot key on one shard	CPU/NIC saturated	p99 latency one shard only	Key fan-out; local cache; dedicated replica scale
FP7	TTL expiry + cache miss	DB overload	DB RPS spike at :00	Single-flight; jitter TTL; proactive refresh
FP8	Eviction of hot key	Key gone despite demand	Miss rate spike one key	LFU policy; protect pinned keys; raise memory
FP9	Memory limit / OOM	Process killed	Node flapping	maxmemory + eviction; monitor; no unbounded keys
FP10	Invalidation race (cache-aside)	Del then old SET wins	Stale data served hours	TTL always; version in value; delete-after-write ordering

flowchart LR
  APP[App] -->|FP7 FP10| CL[Client]
  CL -->|FP4 FP5| P[Primary]
  P -->|FP1 FP3| R[Replica]
  P -->|FP6| P
  P -->|FP9| P
  CM[Cluster manager] -->|FP2 FP5| P

8. Failure modes

8.1 Cache stampede

Symptom: DB latency spikes when hot key expires.

Cause: FP7 — thundering herd on miss.

Safe response: Single-flight loader; stagger TTL with jitter.

8.2 Stale read after write

Symptom: User updates profile; still sees old name.

Cause: FP3 — read replica lag; FP10 — invalidation race.

Safe response: Read-from-primary after write; version field in cached blob.

8.3 Lost writes on failover

Symptom: Session exists on one server, gone after failover.

Cause: FP1 — async replication.

Safe response: Session in DB; or sync repl; client retry.

8.4 Split brain divergent values

Symptom: Same key returns different values from different clients.

Cause: FP2 — network partition; dual primary.

Safe response: Quorum-based failover; fence old primary; short TTL on all keys.

8.5 Cluster-wide cold start

Symptom: Cache empty after restart; DB melts.

Cause: No persistence; mass restart.

Safe response: Warm cache from snapshot; gradual traffic ramp; circuit breaker on DB.

8.6 Hot key shard meltdown

Symptom: One shard at 100% CPU; rest idle.

Cause: FP6 — skewed access.

Safe response: Replicate key to multiple shards; app-side local cache.

8.7 Migration data loss

Symptom: Keys missing after adding node.

Cause: FP5 — incomplete migration; wrong slot state.

Safe response: Verify key count per slot; rollback migration; ASK until complete.

8.8 MOVED loop / wrong routing

Symptom: Clients spin on redirects.

Cause: FP4 — buggy client map.

Safe response: Max redirect count; force topology refresh from cluster manager.

8.9 OOM kill cascade

Symptom: Nodes die repeatedly under memory pressure.

Cause: FP9 — no eviction; large values.

Safe response: maxmemory policy; alert at 80%; reject large SETs.

8.10 Cache penetration (bogus keys)

Symptom: Attack queries random IDs; DB always hit.

Cause: Not a cache bug—missing negative cache.

Safe response: Cache empty results briefly; bloom filter in app.

Failure mode	Primary failure points	Impact	Core mitigation
Cache stampede	FP7	DB overload	Single-flight + jitter TTL
Stale read	FP3, FP10	Bad UX	Read primary / versioning
Lost write on failover	FP1	Data loss feel	Sync repl or don’t trust cache
Split brain	FP2	Inconsistency	Quorum failover
Cold start	Restart	DB spike	Warm RDB; ramp traffic
Hot key	FP6	Latency	Key fan-out
Migration loss	FP5	Missing keys	Safe migration protocol
Routing loop	FP4	Outage	Client map refresh
OOM cascade	FP9	Shard down	Eviction policy
Cache penetration	App pattern	DB load	Negative cache

9. Scalability, availability, and security

9.1 Scalability

Add shards to scale memory and write QPS.
Add read replicas per hot shard.
Pipelining and connection pooling on clients.
Avoid cross-slot multi-key transactions in MVP.

9.2 Availability

Multi-AZ placement; replica in different AZ than primary.
Automatic failover with health checks (heartbeat < 1 s detect, < 30 s promote).
Application: treat cache miss as normal—never require cache for liveness.

9.3 Security

TLS in transit; ACL per application user.
Disable dangerous commands in prod (FLUSHALL).
Do not store secrets unencrypted; cache is not a vault.

10. Tradeoffs recap

Decision	Common choice	Why
Sharding	Consistent hashing + vnodes	Minimal key movement on scale
Replication	Async primary-replica	Latency vs durability
App pattern	Cache-aside	Simple; DB is truth
Eviction	allkeys-lru / allkeys-lfu	Protect hot items (LFU)
Persistence	Optional RDB for warm start	Cache not primary store

11. How to present this in 45 minutes

5 min — requirements; clarify cache vs primary database.
7 min — capacity: memory, ops/sec per shard, replication factor.
10 min — consistent hashing; primary-replica; client routing diagram.
8 min — cache-aside, hot keys, stampede, eviction/TTL.
10 min — failure points + failure modes (split brain, stampede, migration).
5 min — slot migration, tradeoffs vs Memcached/Redis managed.

The one line to remember

A distributed cache is sharded in-memory state with consistent hashing, async replication for read scale and failover, and application patterns that treat it as optional acceleration—while you plan for hot keys, TTL stampedes, and topology changes as first-class failure domains.