Redis is unavailable — fail open or fail closed?
Scenario
Redis cluster fails over slowly, network partitions, or ElastiCache maintenance window. Apps log RedisConnectionFailureException. Some teams want the site up without cache (fail open to DB); others must reject traffic when sessions or rate limits cannot be enforced (fail closed). The right answer depends on what Redis is doing in each code path—not one global flag.
After reading, you should be able to:
- Classify Redis usage: optional cache vs required coordination.
- Choose fail-open vs fail-closed per use case with a decision table.
- Implement timeouts, circuit breakers, and DB fallback without stampeding the database.
- Align health checks and runbooks with business risk.
Why — Redis is not always “just cache”
Fail open (degraded mode): on Redis error, skip cache and continue—usually hit the database or a default. Service stays available; load and latency may spike. Fail closed: on Redis error, reject or error the operation—used when correctness or security requires Redis (session store, rate limit, distributed lock, idempotency token).
Use cases and typical stance
| Redis role | Fail open? | Risk if wrong |
|---|---|---|
| Read-through cache | Often yes (fallback DB) | DB overload — stampede, pool exhausted |
| Session store | Usually no | Users logged out or cross-session bleed if misdesigned |
| Rate limiting | Policy: no (strict) or yes (availability) | Abuse vs outage |
| Distributed lock | No | Double payment, duplicate jobs |
| Idempotency keys | No | Duplicate charges on retry |
| Pub/sub invalidation | Yes with stale TTL backstop | Temporary staleness |
| Feature flags / config cache | Often yes (defaults or DB) | Wrong flag rare if versioned in DB |
Fail open to DB without limits is a common outage amplifier. Redis died → every request hits PostgreSQL → worse total outage. Pair fail-open with circuit breaker, bulkhead, and request shedding.
What — assess impact and symptoms
-
Confirm Redis is the failure point
— app errors (
RedisConnectionException, timeouts); RedisPINGfails from pod; cloud console shows failover/maintenance. - Inventory call sites — grep codebase for Redis template, Redisson, Spring Data Redis; tag each: cache / session / lock / limiter.
- Metrics — command error rate, latency p99, connection count; correlate with API errors and DB QPS spike.
- User-visible impact — 500 on all routes vs only slow reads vs auth failures.
- Readiness probe behavior — are pods removed from LB because Redis is in readiness? intentional or not?
How — implement resilient behavior
1. Short timeouts and circuit breaker
# Lettuce / Spring Boot — fail fast, don't block threads spring.data.redis.timeout=200ms # Resilience4j: open circuit after Redis error rate threshold
Threads blocked on Redis tie up HTTP workers—same as slow DB.
2. Cache path: fail open with guardrails
Optional<User> cached = redisGet(key);
if (cached.isPresent()) return cached.get();
try {
return redisGetOrLoad(key, () -> userRepo.findById(id));
} catch (RedisException e) {
meter.increment("redis.fallback.db");
return userRepo.findById(id); // direct DB, one request
}
- Do not retry Redis indefinitely per request.
- When circuit open, skip Redis entirely for cooldown window.
- Cap DB QPS: rate limit, cache locally (short Caffeine L1), or shed load — backpressure.
3. Session / auth: usually fail closed
Return 503 “try again” or maintain sticky sessions to another region—do not invent anonymous sessions. Optional: read-only mode for public catalog only.
4. Rate limit: explicit product decision
| Policy | When |
|---|---|
| Fail closed | Fraud-sensitive APIs, login, payments |
| Fail open | Internal tools, read-mostly during known Redis maintenance (with approval) |
5. Locks and idempotency: fail closed
If lock cannot be acquired reliably, do not process payment—return retryable error. Never run two writers because Redis lock disappeared.
6. Health checks
- Liveness — JVM up; do not include Redis.
- Readiness — include Redis only if pod cannot serve traffic without it (sessions). For cache-only Redis, stay ready and degrade.
7. Infrastructure
- Redis Sentinel or cluster with automatic failover; multi-AZ.
- Connection pooling (Lettuce), reasonable max connections per pod.
- Chaos test: block Redis port in staging, observe DB and error budget.
Runbook (incident)
- Confirm scope: all apps or one namespace.
- Enable degraded mode flag if you have one (force DB-only with rate limit).
- Scale DB/read replicas if fail-open is active.
- Restore Redis; watch stampede on recovery as cache cold—prewarm hot keys.
Interview one-liner
“I classify each Redis use: optional cache can fail open to DB with a circuit breaker and stampede protection; sessions, locks, and idempotency fail closed. Short timeouts, readiness only when Redis is required, and chaos tests prove DB can survive cache loss.”