Clients retry on 429 and make the outage worse

Scenario

You rate-limit to protect the API. Clients receive 429 Too Many Requests, immediately retry—often in sync—and effective load becomes 2–10× intended RPS. The service never recovers; everyone sees errors. This retry storm is a classic metastable failure, related to timeout cascades and brief unresponsive windows.

After reading, you should be able to:

Why — retries turn protection into overload

Rate limiting protects your service by rejecting excess requests fast (cheap). If every rejected client retries immediately—and many clients share the same backoff schedule— traffic stays above capacity. 429 means “slow down”; ignoring it defeats the limiter. Mobile apps, SDKs, gateways, and batch jobs may all retry unless explicitly disciplined.

429 vs 503 (client behavior)

CodeMeaningTypical client action
429Rate limit / quotaBackoff per Retry-After
503Unavailable / overloadRetry with backoff (risk storm)
500Server errorRetry only if idempotent

What — detect a retry storm

  1. 429 rate high but unique clients moderate—same clients hammering.
  2. Gateway access logs — same client_id / IP with many requests per second after 429s.
  3. RPS > sustainable capacity even after limiting enabled—retry multiplier.
  4. Retry headersRetry-After ignored; fixed-interval retries visible in client metrics.
  5. Downstream sync retries — job scheduler retries all failed rows at once after batch 429.

How — server and client design

Server: rate limit correctly

HTTP/1.1 429 Too Many Requests
Retry-After: 2
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1716123456

Implement with gateway (Kong, Envoy), Redis token bucket, or Bucket4j in app—fail fast, cheap path.

Client: retry policy (required)

# Pseudocode
if response.status == 429:
  wait = parse_retry_after(response)  # seconds; cap max e.g. 60s
  wait += random_jitter(0, wait * 0.2)
  sleep(wait)
  retry_at_most(3_total)
else if response.status in (500, 502, 503, 504):
  if idempotent:
    exponential_backoff_with_jitter(base=1s, max=30s, max_attempts=3)
  else:
    do_not_retry

Rules of thumb

RuleWhy
Exponential backoff + jitterDesynchronize clients
Cap max retriesStop infinite loops
Respect Retry-AfterServer tells you minimum wait
Idempotency-Key on POST retriesidempotency guide
Circuit breaker after repeated 429circuit breaker
No retry parallel fan-outOne retry per logical operation

SDK / platform guidance

Incident response

  1. Identify top retrying clients; contact owners to disable or fix backoff.
  2. Temporary stricter limits at edge; shed non-critical routes.
  3. Scale only if capacity is the issue—not if retries are the multiplier.

Verify

  1. Load test: clients with bad retry policy vs good—good stays under limit.
  2. 429 rate drops when incident clients back off.
  3. Metrics: retry_count per client id bounded.

Interview one-liner

“On the server I return 429 with Retry-After and per-tenant limits; on the client I use exponential backoff with jitter, cap retries, only retry idempotent calls, and open a circuit after sustained 429—so rate limiting protects the service instead of triggering a retry storm.”

Related scenarios