Design a distributed rate limiter

A rate limiter caps how often a client can call your APIs—protecting databases, controlling cost, and ensuring fair use across tenants. At one machine, a counter in memory is enough. At platform scale, thousands of stateless servers must share one consistent budget per API key, user, or IP, with low latency and no race-induced over-admission.

Interviews focus on algorithm choice (token bucket vs sliding window), Redis (or similar) for centralized state, atomic updates, multi-region behavior, and what happens when the limiter itself fails—fail-open vs fail-closed. This guide matches the depth of the other classic prompts here, including failure points and failure modes.

Design prompt

Design a distributed rate limiting service used by all API gateways and internal microservices.

Support per-client rules, burst traffic, low latency (< 5 ms p99), and consistent enforcement across regions.

What you should be able to do after reading:

Compare fixed window, sliding window, token bucket, and leaky bucket.
Implement atomic checks with Redis + Lua (or cell-based counters).
Place the limiter at gateway vs sidecar vs library—and explain tradeoffs.
Size Redis for peak check RPS and hot keys.
Map failure points and choose fail-open vs fail-closed per API class.

1. Requirements gathering

1.1 Functional requirements

Allow / deny decision — given identity + rule, return allowed or 429 Too Many Requests.
Rule types — limit per API key, user ID, IP, endpoint, or global platform cap.
Configurable limits — e.g. 1000 requests/minute with burst 2000.
Multiple windows — per second + per day on same client (optional).
Dynamic config — ops updates limits without redeploying all services.
Headers — return X-RateLimit-Limit, Remaining, Reset (RFC-style).
Whitelist / blacklist — bypass or hard-block specific keys.

Usually out of scope unless asked: full WAF/DDoS (Layer 3/4), bot detection ML, billing metering (related but separate product).

1.2 Non-functional requirements

Low latency — p99 < 5 ms added to request path (often < 2 ms with local cache).
High throughput — millions of checks per second platform-wide.
Accuracy — do not admit 10× budget because of races (eventual consistency OK for some tiers).
Availability — limiter outage must not take down entire product (policy-dependent).
Consistency — strong per-key consistency in one region; cross-region may be approximate.
Fairness — no single tenant starves others on shared infrastructure.
Observability — top violators, 429 rate per rule, Redis latency.

Assumptions for capacity math: 50,000 peak API RPS platform-wide; each request performs 1 rate limit check; average 2 rules evaluated per request (global + per API key); 10M registered API keys; 1% of keys active in any minute; Redis cluster in each region.

2. Capacity estimation

2.1 Check throughput

Peak API RPS = 50,000
Rules per request = 2
Peak rate-limit checks/sec = 50,000 × 2 = 100,000/sec per region

3 regions (US, EU, APAC) → ~100k/sec each if traffic geo-routed
Global Redis ops (if single global store) = 300,000 ops/sec (upper bound)

Each check is typically 1–2 Redis commands (GET + INCR or single Lua EVAL). Plan for 100k–300k Redis ops/sec depending on architecture.

2.2 Redis sizing (rule of thumb)

Single Redis shard: ~100k–250k simple ops/sec depending on instance and pipelining. At 100k checks/sec you need cluster mode with multiple shards or aggressive local caching.

Sharding key: hash(api_key) mod N_shards
Hot key risk: one viral partner → single shard overload

2.3 Memory per counter

Active keys in 1-minute window ≈ 10M × 1% = 100,000 hot keys
Bytes per key (token bucket state): ~64 bytes (key + tokens + last_refill_ts)
Memory ≈ 100k × 64 B ≈ 6.4 MB hot set (tiny)

Rules/config in DB: 10M keys × 200 B rules ≈ 2 GB (cached in gateway memory)

Memory is rarely the bottleneck—ops/sec and hot keys are.

2.4 Network

100k checks/sec × 200 bytes round-trip ≈ 20 MB/s Redis traffic per region
Co-locate limiter Redis with API gateway AZ to minimize RTT

3. High-level design

3.1 Placement options

Placement	Pros	Cons
API gateway	Central enforcement; one config surface	Internal service-to-service calls may bypass
Sidecar (Envoy)	Per-pod; uniform for mesh traffic	Ops complexity
Shared library + Redis	Flexible in any service	Inconsistent if not all services adopt
Dedicated Rate Limiter service	gRPC/HTTP `/check`; language-agnostic	Extra network hop; must be HA

Common interview answer: gateway for edge traffic + library/service for internal critical paths.

Rate Limiter service — stateless; calls Redis; exposes Allow(key, rule).
Redis cluster — holds counters / token state; Lua for atomicity.
Config service — rules per API key; pushed to gateway cache.
Local cache (optional) — short TTL allowance cache to cut Redis QPS.
Analytics — async stream of allow/deny events.

flowchart LR
  C[Client] --> GW[API Gateway]
  GW --> LC[Local allowance cache]
  LC -->|miss| RL[Rate Limiter service]
  RL --> R[(Redis cluster)]
  GW --> API[Backend APIs]
  CFG[Config service] --> GW
  CFG --> RL
  RL --> AN[Analytics Kafka]

Request path

Request arrives with Authorization: Bearer <api_key>.
Gateway resolves rule: 1000 req/min, burst 1500.
Build limiter key: rl:api_key:abc123:rule:default.
Call Redis (Lua) → allowed + remaining tokens, or denied + retry-after.
If allowed → forward to backend; if denied → 429 without hitting DB.

sequenceDiagram
  participant C as Client
  participant G as Gateway
  participant R as Redis
  participant B as Backend
  C->>G: GET /v1/resource
  G->>R: EVAL token_bucket Lua
  alt allowed
    R-->>G: OK remaining=742
    G->>B: proxy request
    B-->>G: 200
    G-->>C: 200 + rate limit headers
  else denied
    R-->>G: DENY retry_after=12
    G-->>C: 429 Too Many Requests
  end

4. Database and state design

4.1 Rules store (PostgreSQL)

CREATE TABLE rate_limit_rules (
  id            UUID PRIMARY KEY,
  subject_type  TEXT NOT NULL,  -- api_key | user | ip | global
  subject_id    TEXT NOT NULL,
  window_sec    INT NOT NULL,
  limit_count   INT NOT NULL,
  burst_count   INT,            -- optional for token bucket
  algorithm     TEXT NOT NULL DEFAULT 'token_bucket',
  enabled       BOOLEAN NOT NULL DEFAULT TRUE,
  updated_at    TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  UNIQUE (subject_type, subject_id)
);

Gateways cache rules in memory; invalidate on config version bump via pub/sub.

4.2 Redis key design

# Token bucket (hash)
KEY:   rl:tb:{subject_type}:{subject_id}
FIELDS: tokens (float), last_refill_ms (int)

# Sliding window counter (string counters)
KEY:   rl:sw:{subject}:{window_start_epoch}
VALUE: count (INCR)

TTL:   window_sec + small buffer (auto expire)

4.3 Atomic token bucket (Lua sketch)

-- KEYS[1] = bucket key
-- ARGV[1] = refill_rate per sec
-- ARGV[2] = capacity (burst)
-- ARGV[3] = now_ms
-- ARGV[4] = requested tokens (usually 1)

local data = redis.call('HMGET', KEYS[1], 'tokens', 'ts')
local tokens = tonumber(data[1]) or ARGV[2]
local last_ts = tonumber(data[2]) or ARGV[3]
local delta = math.max(0, ARGV[3] - last_ts) / 1000
tokens = math.min(ARGV[2], tokens + delta * ARGV[1])
if tokens >= ARGV[4] then
  tokens = tokens - ARGV[4]
  redis.call('HMSET', KEYS[1], 'tokens', tokens, 'ts', ARGV[3])
  redis.call('EXPIRE', KEYS[1], 3600)
  return {1, tokens}  -- allowed, remaining
else
  return {0, tokens}  -- denied, remaining
end

Lua runs atomically on the Redis shard that owns the key—critical to prevent over-admission.

5. API design

5.1 Check allowance (internal)

POST /v1/ratelimit/check

{
  "subject": { "type": "api_key", "id": "key_abc" },
  "rule_id": "default",
  "cost": 1
}

200 OK — allowed

{
  "allowed": true,
  "limit": 1000,
  "remaining": 742,
  "reset_at": "2026-05-27T20:16:00Z"
}

200 OK — denied (or 429 at gateway)

{
  "allowed": false,
  "retry_after_sec": 12
}

5.2 Admin: set rule

PUT /v1/ratelimit/rules/{subject_type}/{subject_id}

{
  "window_sec": 60,
  "limit_count": 1000,
  "burst_count": 1500,
  "algorithm": "token_bucket"
}

5.3 Response headers (edge)

HTTP/1.1 200 OK
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 742
X-RateLimit-Reset: 1716840960
Retry-After: 12          # only on 429

6. Diving deep into key components

6.1 Algorithm comparison

Algorithm	Behavior	Burst	Boundary issue	Interview use
Fixed window	Counter per clock minute	Hard cap per window	2× spike at window edges	Simplest; mention flaw
Sliding window log	Store each request timestamp	Accurate	Memory heavy	Precision when asked
Sliding window counter	Weighted previous + current window	Smooth	Approximate	Common Redis pattern
Token bucket	Tokens refill at rate r, cap B	Allows burst up to B	None at boundary	Default recommendation
Leaky bucket	Constant outflow rate	Smooth output	Queues bursts	Traffic shaping to downstream

6.2 Fixed window spike (worked example)

Limit 100 req/min. Client sends 100 requests at 00:00:59 and 100 at 00:01:00 → 200 in 2 seconds while “complying” with two windows. Sliding window or token bucket fixes this.

6.3 Sliding window counter (Redis)

window = 60 sec, limit = 1000
now = current minute bucket t1, previous bucket t0

count = count_t0 * (1 - elapsed/60) + count_t1
if count >= 1000: deny
else: INCR count_t1, allow

6.4 Token bucket parameters

refill_rate — sustained RPS (e.g. 1000/60 ≈ 16.67 tokens/sec for 1000/min).
capacity — max burst (e.g. 1500 allows short spikes).
cost — expensive endpoints deduct 10 tokens (weighted limits).

6.5 Local cache (hybrid)

Gateway keeps in-memory “spent” counter for 50–100 ms, batching Redis updates. Risk: slight over-admission across pods—acceptable for abuse prevention if global cap still enforced on Redis every N requests or on deny path always hit Redis.

Safer pattern: local cache only for known-good remaining > threshold; always consult Redis when local estimate low.

6.6 Multi-region strategies

Strategy	Consistency	Latency	Notes
Regional Redis only	Per-region budget (1/N of global)	Best	User may get N× limit via VPN regions—often OK
Global Redis (single primary)	Strong global	Cross-region RTT	Hard at scale
Async cross-region sync	Eventual	Low local	Complex; rare in interviews

Interview default: partition budget by region or route user to home region sticky.

6.7 Fail-open vs fail-closed

Policy	When Redis down	Risk
Fail-open	Allow traffic	Outage → abuse, overload backend
Fail-closed	Return 503/429	Outage → product down
Degraded local limit	Conservative per-pod cap	Under-utilization; uneven across pods

Public APIs often fail-closed for paid tiers only after local degrade; internal admin APIs may fail-open with alert.

6.8 Hot key mitigation

Shard per key — Redis cluster hashes key to slot.
Local token pool — allocate sub-budget from Redis every 100 ms for hot partners.
Separate limit tier — dedicated Redis shard for enterprise keys.
Request coalescing — not for limiter (each request must decide)—but batch INCR in side channel for analytics only.

7. Failure points

Failure points are where faults cause over-admission (abuse), false rejects (revenue loss), or cascading outages. Rate limiting is on the critical path—design explicitly for Redis and config failures.

#	Failure point	What breaks	Detection	Mitigation design
FP1	Gateway bypass	Traffic hits backend directly	Spike bypasses 429	mTLS internal only; network policy; library in all entrypoints
FP2	Non-atomic read-modify-write	Two pods INCR without Lua	Limit 1000 but 1500 admitted	Lua / Redis INCR single command / Redlock not needed if single key op atomic
FP3	Redis primary failure	Checks timeout	429/503 spike; latency SLO breach	Replica promotion; circuit breaker; degraded local limit
FP4	Hot key on one shard	Single shard CPU 100%	p99 latency for one API key	Local sub-pool; dedicated shard; raise partner tier hardware
FP5	Stale rule cache	Config lowered limit not applied	Partner exceeds new cap	Versioned config; TTL + pub/sub invalidation
FP6	Clock skew	Window boundaries wrong	Uneven deny rate	Redis TIME; token bucket uses monotonic ms from Redis
FP7	Cross-region double spend	User hits US + EU	2× global budget	Regional quota split; global id in key; sticky routing
FP8	Local cache over-admit	Pods each spend local budget	Aggregate RPS > limit	Sync to Redis before local budget exhausted; conservative factor
FP9	Fail-open misconfiguration	Redis blip → unlimited traffic	Backend DB meltdown	Default fail-closed for public; cap fail-open duration
FP10	429 retry storm	Clients retry immediately	Traffic amplifies 3×	Retry-After header; jittered backoff; don’t count 429 against limit optionally

flowchart LR
  C[Client] -->|FP1| GW[Gateway]
  GW -->|FP8| LC[Local cache]
  LC -->|FP2 FP3| R[(Redis)]
  GW -->|FP5| CFG[Config]
  GW -->|FP7| R2[(Region Redis)]
  C -->|FP10| C
  R -->|FP4| R
  R -->|FP6| R
  GW -->|FP9| API[Backend]

8. Failure modes

8.1 Window boundary spike (fixed window)

Symptom: Traffic doubles for 1 second at minute rollovers.

Cause: Fixed window algorithm without sliding/token bucket.

Safe response: Migrate to token bucket or sliding window counter.

8.2 Race over-admission

Symptom: Partner exceeds contracted RPS during load test.

Cause: FP2 — GET then SET from two gateways.

Safe response: Lua script; single atomic INCR with EXPIRE for fixed window fallback.

8.3 Redis outage → backend overload

Symptom: DB CPU 100%; limiter “fixed” when Redis returns.

Cause: FP9 fail-open.

Safe response: Fail-closed or strict per-pod emergency cap; queue at gateway.

8.4 Redis outage → mass 503

Symptom: All API calls rejected though backend healthy.

Cause: FP3 fail-closed without degrade path.

Safe response: Degraded local limiter; multi-AZ Redis Sentinel/cluster.

8.5 Hot key latency collapse

Symptom: One API key sees 5 s latency; others fine.

Cause: FP4 — viral integration on single Redis slot.

Safe response: Pre-allocated local token pool; move key to dedicated shard.

8.6 Stale limits after downgrade

Symptom: Abusive key not throttled after ops cut limit.

Cause: FP5 — gateway cache TTL 5 min.

Safe response: Push invalidation; include config_version in check path.

8.7 Multi-region budget multiplication

Symptom: Global partner loads 3 regions; 3× traffic accepted.

Cause: FP7 — independent regional counters.

Safe response: Document as design choice or add lightweight global sync counter for enterprise keys.

8.8 False rejects (flapping)

Symptom: Paying customer sees 429 at 80% of true usage.

Cause: FP8 local cache too aggressive; clock skew FP6.

Safe response: Tune burst; use Redis TIME; expose remaining in headers for client backoff.

8.9 Retry amplification

Symptom: Effective RPS 3× after rate limit events.

Cause: FP10 — clients ignore Retry-After.

Safe response: Exponential backoff docs; penalize rapid retries (optional separate counter).

8.10 Thundering herd on Redis recovery

Symptom: Redis comes back; 100k reconnects stall cluster.

Cause: All gateways hammer Redis simultaneously.

Safe response: Jittered reconnect; ramp local caches; circuit breaker half-open probe.

Failure mode	Primary failure points	Impact	Core mitigation
Boundary spike	Algorithm	Backend spike	Token bucket / sliding window
Race over-admission	FP2	Abuse	Lua atomicity
Fail-open overload	FP9, FP3	Outage cascade	Degraded cap / fail-closed
Fail-closed outage	FP3	Product down	HA Redis + local degrade
Hot key	FP4	Latency	Shard / local pool
Stale config	FP5	Policy miss	Invalidation
Region multiplication	FP7	Over quota	Split budget / global key
False rejects	FP6, FP8	Customer pain	Tune cache; Redis TIME
Retry storm	FP10	Amplified load	Retry-After + jitter
Recovery herd	FP3	Redis flap	Circuit breaker ramp

9. Scalability, availability, and security

9.1 Scalability

Redis Cluster with hash tags for related keys if needed: {api_key}:tokens.
Horizontal scale Rate Limiter service pods; all state in Redis.
Pipelining / connection pooling from gateway to Redis.
Separate read-heavy config from write-heavy counters.

9.2 Availability

Redis: primary + replica, automatic failover (Sentinel or managed ElastiCache/Memorystore).
Multi-AZ deployment; limit cross-AZ Redis RTT if co-located with gateway.
Health check: if Redis unhealthy > N sec → switch degrade policy.

9.3 Security

Rate limit by API key + IP for stolen key abuse.
Admin rule APIs behind IAM; audit changes.
Prevent key enumeration: same 429 body for unknown vs known keys on public edge.
DDoS at edge still needs CDN/WAF—rate limiter is Layer 7 per identity.

10. Tradeoffs recap

Decision	Common choice	Why
Algorithm	Token bucket	Burst + smooth refill
Store	Redis + Lua	Fast atomic in-memory ops
Placement	API gateway	Single enforcement point
Multi-region	Regional budget split	Latency vs global accuracy
Redis down	Fail-closed + degrade	Protect backend first

11. How to present this in 45 minutes

5 min — requirements; clarify distributed vs single host; rules per key.
7 min — capacity: 100k checks/sec, Redis cluster, hot keys.
8 min — algorithms table; pick token bucket; show fixed-window flaw.
8 min — architecture: gateway → Redis Lua; config service.
10 min — failure points + fail-open/closed + failure modes.
7 min — multi-region, local cache tradeoff, headers, extensions (weighted cost).

The one line to remember

A distributed rate limiter is a centralized counter with atomic decrements—usually token bucket in Redis + Lua—sitting as close to the edge as possible, with an explicit story for hot keys, multi-region budgets, and whether you fail-open or fail-closed when Redis is unavailable.