Redis is unavailable — fail open or fail closed?

Scenario

Redis cluster fails over slowly, network partitions, or ElastiCache maintenance window. Apps log RedisConnectionFailureException. Some teams want the site up without cache (fail open to DB); others must reject traffic when sessions or rate limits cannot be enforced (fail closed). The right answer depends on what Redis is doing in each code path—not one global flag.

After reading, you should be able to:

Why — Redis is not always “just cache”

Fail open (degraded mode): on Redis error, skip cache and continue—usually hit the database or a default. Service stays available; load and latency may spike. Fail closed: on Redis error, reject or error the operation—used when correctness or security requires Redis (session store, rate limit, distributed lock, idempotency token).

Use cases and typical stance

Redis roleFail open?Risk if wrong
Read-through cacheOften yes (fallback DB)DB overload — stampede, pool exhausted
Session storeUsually noUsers logged out or cross-session bleed if misdesigned
Rate limitingPolicy: no (strict) or yes (availability)Abuse vs outage
Distributed lockNoDouble payment, duplicate jobs
Idempotency keysNoDuplicate charges on retry
Pub/sub invalidationYes with stale TTL backstopTemporary staleness
Feature flags / config cacheOften yes (defaults or DB)Wrong flag rare if versioned in DB

Fail open to DB without limits is a common outage amplifier. Redis died → every request hits PostgreSQL → worse total outage. Pair fail-open with circuit breaker, bulkhead, and request shedding.

What — assess impact and symptoms

  1. Confirm Redis is the failure point — app errors (RedisConnectionException, timeouts); Redis PING fails from pod; cloud console shows failover/maintenance.
  2. Inventory call sites — grep codebase for Redis template, Redisson, Spring Data Redis; tag each: cache / session / lock / limiter.
  3. Metrics — command error rate, latency p99, connection count; correlate with API errors and DB QPS spike.
  4. User-visible impact — 500 on all routes vs only slow reads vs auth failures.
  5. Readiness probe behavior — are pods removed from LB because Redis is in readiness? intentional or not?

How — implement resilient behavior

1. Short timeouts and circuit breaker

# Lettuce / Spring Boot — fail fast, don't block threads
spring.data.redis.timeout=200ms
# Resilience4j: open circuit after Redis error rate threshold

Threads blocked on Redis tie up HTTP workers—same as slow DB.

2. Cache path: fail open with guardrails

Optional<User> cached = redisGet(key);
if (cached.isPresent()) return cached.get();
try {
  return redisGetOrLoad(key, () -> userRepo.findById(id));
} catch (RedisException e) {
  meter.increment("redis.fallback.db");
  return userRepo.findById(id);  // direct DB, one request
}

3. Session / auth: usually fail closed

Return 503 “try again” or maintain sticky sessions to another region—do not invent anonymous sessions. Optional: read-only mode for public catalog only.

4. Rate limit: explicit product decision

PolicyWhen
Fail closedFraud-sensitive APIs, login, payments
Fail openInternal tools, read-mostly during known Redis maintenance (with approval)

5. Locks and idempotency: fail closed

If lock cannot be acquired reliably, do not process payment—return retryable error. Never run two writers because Redis lock disappeared.

6. Health checks

7. Infrastructure

Runbook (incident)

  1. Confirm scope: all apps or one namespace.
  2. Enable degraded mode flag if you have one (force DB-only with rate limit).
  3. Scale DB/read replicas if fail-open is active.
  4. Restore Redis; watch stampede on recovery as cache cold—prewarm hot keys.

Interview one-liner

“I classify each Redis use: optional cache can fail open to DB with a circuit breaker and stampede protection; sessions, locks, and idempotency fail closed. Short timeouts, readiness only when Redis is required, and chaos tests prove DB can survive cache loss.”

Related scenarios