Redis is unavailable — fail open or fail closed?

Scenario

Redis cluster fails over slowly, network partitions, or ElastiCache maintenance window. Apps log RedisConnectionFailureException. Some teams want the site up without cache (fail open to DB); others must reject traffic when sessions or rate limits cannot be enforced (fail closed). The right answer depends on what Redis is doing in each code path—not one global flag.

After reading, you should be able to:

Classify Redis usage: optional cache vs required coordination.
Choose fail-open vs fail-closed per use case with a decision table.
Implement timeouts, circuit breakers, and DB fallback without stampeding the database.
Align health checks and runbooks with business risk.

Why — Redis is not always “just cache”

Fail open (degraded mode): on Redis error, skip cache and continue—usually hit the database or a default. Service stays available; load and latency may spike. Fail closed: on Redis error, reject or error the operation—used when correctness or security requires Redis (session store, rate limit, distributed lock, idempotency token).

Use cases and typical stance

Redis role	Fail open?	Risk if wrong
Read-through cache	Often yes (fallback DB)	DB overload — stampede, pool exhausted
Session store	Usually no	Users logged out or cross-session bleed if misdesigned
Rate limiting	Policy: no (strict) or yes (availability)	Abuse vs outage
Distributed lock	No	Double payment, duplicate jobs
Idempotency keys	No	Duplicate charges on retry
Pub/sub invalidation	Yes with stale TTL backstop	Temporary staleness
Feature flags / config cache	Often yes (defaults or DB)	Wrong flag rare if versioned in DB

Fail open to DB without limits is a common outage amplifier. Redis died → every request hits PostgreSQL → worse total outage. Pair fail-open with circuit breaker, bulkhead, and request shedding.

What — assess impact and symptoms

Confirm Redis is the failure point — app errors (RedisConnectionException, timeouts); Redis PING fails from pod; cloud console shows failover/maintenance.
Inventory call sites — grep codebase for Redis template, Redisson, Spring Data Redis; tag each: cache / session / lock / limiter.
Metrics — command error rate, latency p99, connection count; correlate with API errors and DB QPS spike.
User-visible impact — 500 on all routes vs only slow reads vs auth failures.
Readiness probe behavior — are pods removed from LB because Redis is in readiness? intentional or not?

How — implement resilient behavior

1. Short timeouts and circuit breaker

# Lettuce / Spring Boot — fail fast, don't block threads
spring.data.redis.timeout=200ms
# Resilience4j: open circuit after Redis error rate threshold

Threads blocked on Redis tie up HTTP workers—same as slow DB.

2. Cache path: fail open with guardrails

Optional<User> cached = redisGet(key);
if (cached.isPresent()) return cached.get();
try {
  return redisGetOrLoad(key, () -> userRepo.findById(id));
} catch (RedisException e) {
  meter.increment("redis.fallback.db");
  return userRepo.findById(id);  // direct DB, one request
}

Do not retry Redis indefinitely per request.
When circuit open, skip Redis entirely for cooldown window.
Cap DB QPS: rate limit, cache locally (short Caffeine L1), or shed load — backpressure.

3. Session / auth: usually fail closed

Return 503 “try again” or maintain sticky sessions to another region—do not invent anonymous sessions. Optional: read-only mode for public catalog only.

4. Rate limit: explicit product decision

Policy	When
Fail closed	Fraud-sensitive APIs, login, payments
Fail open	Internal tools, read-mostly during known Redis maintenance (with approval)

5. Locks and idempotency: fail closed

If lock cannot be acquired reliably, do not process payment—return retryable error. Never run two writers because Redis lock disappeared.

6. Health checks

Liveness — JVM up; do not include Redis.
Readiness — include Redis only if pod cannot serve traffic without it (sessions). For cache-only Redis, stay ready and degrade.

7. Infrastructure

Redis Sentinel or cluster with automatic failover; multi-AZ.
Connection pooling (Lettuce), reasonable max connections per pod.
Chaos test: block Redis port in staging, observe DB and error budget.

Runbook (incident)

Confirm scope: all apps or one namespace.
Enable degraded mode flag if you have one (force DB-only with rate limit).
Scale DB/read replicas if fail-open is active.
Restore Redis; watch stampede on recovery as cache cold—prewarm hot keys.

Interview one-liner

“I classify each Redis use: optional cache can fail open to DB with a circuit breaker and stampede protection; sessions, locks, and idempotency fail closed. Short timeouts, readiness only when Redis is required, and chaos tests prove DB can survive cache loss.”

Why — Redis is not always “just cache”

Use cases and typical stance

What — assess impact and symptoms

How — implement resilient behavior

1. Short timeouts and circuit breaker

2. Cache path: fail open with guardrails

3. Session / auth: usually fail closed

4. Rate limit: explicit product decision

5. Locks and idempotency: fail closed

6. Health checks

7. Infrastructure

Runbook (incident)

Interview one-liner

Related scenarios