sharpbyte.dev

Design a distributed cache (Redis-like)

A distributed in-memory cache sits between your application and databases—serving GET/SET in sub-millisecond time at massive QPS. Products like Redis, Memcached, and managed ElastiCache solve the same problem: partition data across nodes, replicate for availability, evict when memory is full, and expose simple data structures (strings, hashes, sorted sets) with optional TTL.

In interviews, this question tests sharding (consistent hashing), replication lag, cache-aside vs write-through, hot keys, and what breaks during node failure or cluster rebalancing. This guide matches the depth of the other classic articles here, including failure points and failure modes.

Design prompt

Design a distributed in-memory key-value cache that applications use to reduce database load.

Support horizontal scale, replication, TTL expiration, and high availability when nodes fail or the cluster grows.

What you should be able to do after reading:

1. Requirements gathering

1.1 Functional requirements

Usually out of scope unless asked: full Redis persistence SLAs (AOF fsync every write), pub/sub at global scale, multi-key ACID transactions across shards, building a SQL query engine on top.

1.2 Non-functional requirements

Assumptions for capacity math: 100-node cluster; 64 GB RAM per node (50 GB usable); 500 bytes avg value + key overhead → ~100M keys per node; 10M aggregate ops/sec; 3 replicas per shard for HA (1 primary + 2 replicas); 0.1% of keys are hot (10× traffic).

2. Capacity estimation

2.1 Memory

Nodes = 100
Usable RAM per node = 50 GB
Total cluster capacity ≈ 5 TB

Avg entry size = 500 B (key + value + metadata)
Keys per node ≈ 50 GB / 500 B ≈ 100 million
Total keys ≈ 10 billion (upper bound if perfectly balanced)

2.2 Throughput per node

Cluster target = 10,000,000 ops/sec
Per node (100 shards) ≈ 100,000 ops/sec

Single-threaded Redis-like core ≈ 100k–200k simple ops/sec per core
→ May need multi-threaded I/O or multiple shards per physical node

2.3 Network

10M ops/sec × 1 KB avg payload ≈ 10 GB/sec cluster egress+ingress
Hot keys amplify single-host NIC limits

2.4 Replication overhead

Write amplification: each SET replicated to 2 replicas → 3× write bytes
Read scaling: serve 80% reads from replicas → primary handles writes + 20% reads

3. High-level design

flowchart TB
  APP[Application servers]
  CL[Smart client / proxy]
  subgraph cluster [Cache cluster]
    S1[Shard 1 primary]
    R1[Shard 1 replica]
    S2[Shard 2 primary]
    R2[Shard 2 replica]
    SN[Shard N primary]
  end
  CM[Cluster manager]
  APP --> CL
  CL --> S1
  CL --> S2
  CL --> SN
  S1 --> R1
  S2 --> R2
  CM --> S1
  CM --> S2
  CM --> SN
    

Sharding: consistent hashing

hash_ring = 0 .. 2^32-1
Each physical node = 100 virtual nodes on ring
key_slot = hash(key) mod 2^32
owner = first vnode clockwise from key_slot

Adding a node moves only K/n keys (approx)—not full reshuffle like hash(key) % N.

sequenceDiagram
  participant C as Client
  participant P as Primary shard
  participant R as Replica
  C->>P: SET user:42 profile_blob TTL 3600
  P->>P: apply in memory
  P->>R: replicate SET
  P-->>C: OK
  C->>R: GET user:42 (read from replica)
  R-->>C: profile_blob
    

4. Data design on a shard

4.1 In-memory structure

dict[key] -> {
  value: bytes,
  expires_at: uint64 | null,
  type: string | hash | zset,
  lru_clock: uint32
}

4.2 TTL handling

4.3 Eviction policies (when memory max reached)

PolicyBehaviorUse when
allkeys-lruEvict least recently used any keyGeneral cache
volatile-lruEvict LRU among keys with TTLMix of permanent + TTL
allkeys-lfuEvict least frequently usedHot key retention better
noevictionReturn errors on SET when fullRate limiting counters

5. API design (client protocol)

5.1 Core commands

GET key          → value | NULL
SET key value [EX seconds] [NX]
DEL key          → 1 | 0
INCR key         → integer (atomic on single shard)
MGET k1 k2 k3    → parallel to shards, merge
EXPIRE key ttl

5.2 Cluster-aware redirect

MOVED 3999 10.0.1.15:6379   # permanent slot owner change
ASK  3999 10.0.1.20:6379   # temporary during migration

Client updates slot map on MOVED; retries once on ASK.

5.3 Admin

CLUSTER NODES
CLUSTER ADDSLOTS / MIGRATE key dest-node
FAILOVER shard-3

6. Diving deep into key components

6.1 Cache usage patterns (application side)

PatternReadWritePros / cons
Cache-asideApp reads cache; on miss, read DB and populateApp writes DB then invalidates cacheSimple; stale window possible
Read-throughCache loads from DB on missApp writes through cache layerCentralized logic
Write-throughWrite cache + DB synchronouslyConsistent; slower writes
Write-backWrite cache; async flush to DBFast; risk loss on crash

6.2 Replication

6.3 Hot keys

One viral key (celebrity profile, global config) overloads a single shard.

6.4 Cache stampede (thundering herd)

Popular key expires → 10k threads miss → all hit DB.

6.5 Slot migration (scale out)

  1. Assign empty slots to new node.
  2. For each migrating slot: set keys to IMPORTING on dest, MIGRATING on source.
  3. Move keys in batches; concurrent requests get ASK redirect.
  4. Finish slot; update cluster map; broadcast MOVED.

6.6 Optional persistence (warm restart)

6.7 Single-threaded vs multi-threaded

Redis uses one thread for commands (simple, no locks) + I/O threads in newer versions. Alternative: sharded multi-process on one machine (Memcached style) to use all cores.

7. Failure points

Failure points are where faults cause stale data, lost keys, outages, or backend overload. Cache failures should degrade gracefully—the database must survive cache loss.

#Failure pointWhat breaksDetectionMitigation design
FP1 Primary crash before replicate Recent writes lost Clients see old value after failover Sync repl for critical keys; app treats cache as optional
FP2 Split brain (two primaries) Divergent writes same slot Inconsistent reads per key Fencing tokens; quorum failover; epoch in cluster map
FP3 Replica lag Read your writes fails User sees stale session after update Read from primary after write; monotonic token
FP4 Stale client slot map Requests to wrong node MOVED storm; elevated misses MOVED/ASK handling; periodic map refresh
FP5 Migration in progress Key not found during move Temporary 404 on GET ASK redirects; dual-read source+dest until done
FP6 Hot key on one shard CPU/NIC saturated p99 latency one shard only Key fan-out; local cache; dedicated replica scale
FP7 TTL expiry + cache miss DB overload DB RPS spike at :00 Single-flight; jitter TTL; proactive refresh
FP8 Eviction of hot key Key gone despite demand Miss rate spike one key LFU policy; protect pinned keys; raise memory
FP9 Memory limit / OOM Process killed Node flapping maxmemory + eviction; monitor; no unbounded keys
FP10 Invalidation race (cache-aside) Del then old SET wins Stale data served hours TTL always; version in value; delete-after-write ordering
flowchart LR
  APP[App] -->|FP7 FP10| CL[Client]
  CL -->|FP4 FP5| P[Primary]
  P -->|FP1 FP3| R[Replica]
  P -->|FP6| P
  P -->|FP9| P
  CM[Cluster manager] -->|FP2 FP5| P
    

8. Failure modes

8.1 Cache stampede

Symptom: DB latency spikes when hot key expires.

Cause: FP7 — thundering herd on miss.

Safe response: Single-flight loader; stagger TTL with jitter.

8.2 Stale read after write

Symptom: User updates profile; still sees old name.

Cause: FP3 — read replica lag; FP10 — invalidation race.

Safe response: Read-from-primary after write; version field in cached blob.

8.3 Lost writes on failover

Symptom: Session exists on one server, gone after failover.

Cause: FP1 — async replication.

Safe response: Session in DB; or sync repl; client retry.

8.4 Split brain divergent values

Symptom: Same key returns different values from different clients.

Cause: FP2 — network partition; dual primary.

Safe response: Quorum-based failover; fence old primary; short TTL on all keys.

8.5 Cluster-wide cold start

Symptom: Cache empty after restart; DB melts.

Cause: No persistence; mass restart.

Safe response: Warm cache from snapshot; gradual traffic ramp; circuit breaker on DB.

8.6 Hot key shard meltdown

Symptom: One shard at 100% CPU; rest idle.

Cause: FP6 — skewed access.

Safe response: Replicate key to multiple shards; app-side local cache.

8.7 Migration data loss

Symptom: Keys missing after adding node.

Cause: FP5 — incomplete migration; wrong slot state.

Safe response: Verify key count per slot; rollback migration; ASK until complete.

8.8 MOVED loop / wrong routing

Symptom: Clients spin on redirects.

Cause: FP4 — buggy client map.

Safe response: Max redirect count; force topology refresh from cluster manager.

8.9 OOM kill cascade

Symptom: Nodes die repeatedly under memory pressure.

Cause: FP9 — no eviction; large values.

Safe response: maxmemory policy; alert at 80%; reject large SETs.

8.10 Cache penetration (bogus keys)

Symptom: Attack queries random IDs; DB always hit.

Cause: Not a cache bug—missing negative cache.

Safe response: Cache empty results briefly; bloom filter in app.

Failure modePrimary failure pointsImpactCore mitigation
Cache stampedeFP7DB overloadSingle-flight + jitter TTL
Stale readFP3, FP10Bad UXRead primary / versioning
Lost write on failoverFP1Data loss feelSync repl or don’t trust cache
Split brainFP2InconsistencyQuorum failover
Cold startRestartDB spikeWarm RDB; ramp traffic
Hot keyFP6LatencyKey fan-out
Migration lossFP5Missing keysSafe migration protocol
Routing loopFP4OutageClient map refresh
OOM cascadeFP9Shard downEviction policy
Cache penetrationApp patternDB loadNegative cache

9. Scalability, availability, and security

9.1 Scalability

9.2 Availability

9.3 Security

10. Tradeoffs recap

DecisionCommon choiceWhy
ShardingConsistent hashing + vnodesMinimal key movement on scale
ReplicationAsync primary-replicaLatency vs durability
App patternCache-asideSimple; DB is truth
Evictionallkeys-lru / allkeys-lfuProtect hot items (LFU)
PersistenceOptional RDB for warm startCache not primary store

11. How to present this in 45 minutes

  1. 5 min — requirements; clarify cache vs primary database.
  2. 7 min — capacity: memory, ops/sec per shard, replication factor.
  3. 10 min — consistent hashing; primary-replica; client routing diagram.
  4. 8 min — cache-aside, hot keys, stampede, eviction/TTL.
  5. 10 minfailure points + failure modes (split brain, stampede, migration).
  6. 5 min — slot migration, tradeoffs vs Memcached/Redis managed.

The one line to remember

A distributed cache is sharded in-memory state with consistent hashing, async replication for read scale and failover, and application patterns that treat it as optional acceleration—while you plan for hot keys, TTL stampedes, and topology changes as first-class failure domains.