Resilience Patterns

Why resilience is not optional

A microservice architecture multiplies failure modes. Without deliberate defenses, a 500 ms slowdown in Recommendations becomes a thread-pool exhaustion incident in Checkout.

Cascading failure follows a predictable path: a dependency slows → callers block waiting → their thread pools fill → health checks fail → load balancers remove healthy instances → remaining instances take more load → the system enters a death spiral. Netflix coined much of the modern vocabulary (timeouts, circuit breakers, bulkheads) after learning this the hard way on AWS.

Resilience patterns do not prevent failures—they bound them. You decide how long to wait, how many times to retry, when to stop calling a sick dependency entirely, and what degraded experience to show users instead of a blank 500 page. These policies belong in client libraries, service mesh, and API gateway layers; domain code should declare intent, not rediscover TCP semantics.

flowchart LR
  U[Users] --> A[Checkout]
  A --> B[Payment]
  A --> C[Inventory]
  A --> D[Recommendations]
  D -. slow .-> A
  A -. threads blocked .-> U

🎯 Interview Tip

When designing a system, name failure scenarios explicitly: “If search is down, product page loads without recommendations block.” Interviewers want layered defenses—timeout at client, breaker at gateway, cache fallback at BFF—not a single magic library.

Timeout — stop waiting forever

Every outbound call must have a deadline. Without one, threads block until the OS socket times out—often minutes—while users stare at spinners and your pool starves.

Timeouts cap how long a caller waits for a response. They convert “hang indefinitely” into a controlled failure you can retry, circuit-break, or fallback. The hard part is choosing values: too aggressive and you fail healthy services under normal tail latency; too lenient and you absorb cascading delay.

Setting timeouts in practice

Start from SLO p99 latency of the dependency plus network margin—not from “30 seconds feels safe.”
Align end-to-end user timeout > sum of critical path service timeouts, or users see errors while backends still work.
Use separate connect vs read timeouts: DNS/TLS stalls differ from slow business logic.
Propagate deadlines with gRPC deadline context or HTTP Timeout headers where supported.

Client	Configuration
WebClient	HttpClient.responseTimeout(Duration), Reactor .timeout(Duration)
RestTemplate	SimpleClientHttpRequestFactory.setConnectTimeout / setReadTimeout
OpenFeign	feign.client.config.default.connectTimeout / readTimeout
gRPC	stub.withDeadlineAfter(2, SECONDS) or server interceptor

resilience4j.timelimiter:
  instances:
    inventoryClient:
      timeoutDuration: 2s
      cancelRunningFuture: true

feign:
  client:
    config:
      inventoryClient:
        connectTimeout: 500
        readTimeout: 2000

⚠️ Pitfall

Timeout on the client while the server still processes the request—duplicate side effects if the operation was not idempotent. Pair timeouts with idempotency keys on POST or use outbox/saga for multi-step writes.

Retry — transient failures deserve another chance

Networks glitch; pods restart mid-request. A bounded retry with backoff turns occasional blips into invisible success—if and only if retries are safe and coordinated.

Retries re-issue failed calls when failure looks temporary: connection reset, HTTP 503, gRPC UNAVAILABLE. Blind retries on every error amplify load during outages—the classic retry storm that keeps a recovering service down.

Exponential backoff with jitter

Wait time grows exponentially between attempts: 100 ms → 200 ms → 400 ms, capped at a max. Jitter randomizes delay within a range so thousands of clients do not retry in sync (thundering herd). Formula (full jitter): sleep = random(0, min(cap, base * 2^attempt)).

Idempotency requirement

A retry-safe operation produces the same effect whether executed once or five times. GET and PUT are naturally idempotent; POST creating resources needs an Idempotency-Key header stored server-side. Payment captures and inventory reservations must deduplicate on business key, not HTTP method alone.

What NOT to retry

4xx client errors (400, 404, 422)—repeating will not fix bad input.
429 Too Many Requests—retry only if you honor Retry-After and backoff aggressively.
Non-idempotent POST without deduplication—risk double charge.
Timeouts where server may have succeeded—prefer query status endpoint before blind retry.

@Retry(name = "paymentClient", fallbackMethod = "payFallback")
public PaymentResult charge(PaymentRequest req) {
    return paymentClient.charge(req);
}

// application.yml
// resilience4j.retry.instances.paymentClient.maxAttempts: 3
// waitDuration: 200ms, exponentialBackoffMultiplier: 2, enableRandomizedWait: true
// retryExceptions: IOException, HttpServerErrorException
// ignoreExceptions: HttpClientErrorException

Spring Retry (@Retryable, @Recover) remains in legacy codebases; new Spring Cloud projects standardize on Resilience4j for unified metrics and composition with circuit breakers.

🔬 Under the Hood

Retries multiply traffic: 3 attempts × 1000 RPS = up to 3000 RPS hitting a dependency. Combine with circuit breaker and rate limiter at the edge during incidents.

Circuit breaker — fail fast when dependency is sick

Like an electrical breaker, stop sending current to a faulted line. Callers immediately get failure or fallback while the dependency recovers—and you probe occasionally to see if it healed.

States

CLOSED — normal operation; failures are counted.
OPEN — calls fail immediately (or invoke fallback); no load on dependency.
HALF-OPEN — limited trial calls; success closes breaker, failure reopens.

stateDiagram-v2
  direction LR
  [*] --> Closed
  Closed --> Open: failure rate exceeds threshold
  Open --> HalfOpen: wait interval elapsed
  HalfOpen --> Closed: probe calls succeed
  HalfOpen --> Open: probe calls fail
  Closed --> Closed: calls succeed

Tuning metrics

Parameter	Purpose
failureRateThreshold	Percent failures in window that trips OPEN (e.g. 50%)
slowCallRateThreshold	Treat slow calls as failures for SLA-sensitive deps
slowCallDurationThreshold	Definition of “slow” (e.g. > 2s)
minimumNumberOfCalls	Avoid opening on first failure—need statistical sample
slidingWindowType / size	COUNT (last N calls) vs TIME (last N seconds)
waitDurationInOpenState	How long before HALF-OPEN probes
permittedNumberOfCallsInHalfOpenState	Probe concurrency limit

@CircuitBreaker(name = "inventory", fallbackMethod = "defaultStock")
public StockLevel getStock(String sku) {
    return inventoryClient.fetch(sku);
}

private StockLevel defaultStock(String sku, Throwable t) {
    return StockLevel.unknown(sku);
}

Register event consumers on CircuitBreakerRegistry to log state transitions and export to Micrometer. Dashboards should show OPEN duration—long OPEN states mean dependency or config problems, not “breaker doing its job” forever.

📦 Real World

Netflix Hystrix dashboards were the ops center of gravity during incidents. Modern stacks use Resilience4j metrics in Grafana plus Istio outlier detection for L7 passive health ejection.

Bulkhead — isolate resource pools

Ship compartments limit flooding to one section. Bulkheads cap threads or concurrent calls per dependency so one slow API cannot consume the entire servlet container.

Without bulkheads, all outbound calls share one thread pool (or one reactive event loop stall pattern). When Recommendations hangs, every thread waits on Recommendations and Catalog queries starve—even though Catalog is healthy.

Thread pool bulkhead

Dedicated executor per dependency or per subsystem. Calls submit work to the pool; when queue is full, fail fast. Resilience4j ThreadPoolBulkhead integrates with CompletableFuture-style APIs. Cost: thread overhead—size pools deliberately, not “max threads = 200” everywhere.

Semaphore bulkhead

Limits concurrent in-flight calls without extra threads—good for reactive stacks. Resilience4j Bulkhead with maxConcurrentCalls and maxWaitDuration (zero wait = immediate reject when full).

resilience4j.bulkhead:
  instances:
    recommendations:
      maxConcurrentCalls: 25
      maxWaitDuration: 0

resilience4j.thread-pool-bulkhead:
  instances:
    legacySoap:
      maxThreadPoolSize: 10
      coreThreadPoolSize: 4
      queueCapacity: 20

💡 Pro Tip

Layer bulkhead inside circuit breaker: breaker stops calls when failure rate high; bulkhead caps concurrency when dependency is slow but not yet failing HTTP status codes.

Rate limiting — protect yourself and neighbors

Rate limiters cap how many requests proceed in a time window—protecting downstream capacity, enforcing fair use, and absorbing abusive traffic patterns.

Algorithms compared

Algorithm	Behavior	Trade-off
Token bucket	Tokens refill at steady rate; burst allowed up to bucket size.	Smooth average rate with controlled bursts—common default.
Leaky bucket	Requests queue, exit at fixed rate.	Smooth output; spikes wait or drop—predictable downstream rate.
Fixed window	Count requests per clock window (e.g. per minute).	Simple; boundary spikes at window rollover.
Sliding window	Count over rolling interval.	Smoother than fixed; more state to track.

Resilience4j RateLimiter — per-instance in JVM; good for protecting one service’s calls to a fragile dependency. Spring Cloud Gateway RequestRateLimiter + Redis — distributed limit by user ID or API key at the edge. Redis Lua scripts atomically decrement counters for cluster-wide consistency.

spring:
  cloud:
    gateway:
      routes:
        - id: orders
          uri: lb://order-service
          filters:
            - name: RequestRateLimiter
              args:
                redis-rate-limiter.replenishRate: 50
                redis-rate-limiter.burstCapacity: 100
                key-resolver: "#{@userKeyResolver}"

⚖️ Trade-off

Rate limits improve stability but frustrate legitimate spikes (product launches). Pair with queueing/async for absorbable work and document 429 behavior in client SDKs.

Fallback strategies — something beats nothing

When a dependency fails or the circuit is OPEN, fallbacks define the degraded experience—cached data, empty lists, static defaults, or explicit “try again” errors.

Common patterns

Return cached data — stale recommendations beat blank homepage; stamp with cachedAt for transparency internally.
Return default / empty — empty related products list; zero balance placeholder with banner in UI.
Fail fast — propagate 503 with clear message when operation cannot proceed (payment must not silently skip).
Graceful degradation — core checkout works; non-critical features disabled via feature flags.

When fallback is not enough — compensation

Fallbacks cover read paths well. Write paths need sagas and compensating transactions: if Payment succeeded but Order commit failed, issue refund or mark order PENDING_REVIEW for manual reconciliation. Never hide failed writes behind silent fallbacks—financial and inventory domains require explicit state machines.

🚫 Anti-Pattern

Fallback that returns HTTP 200 with empty body while the operation failed—mobile apps show success; support tickets explode. Match HTTP semantics to business outcome.

Resilience4j full stack — composing policies

Real endpoints stack multiple decorators. Order matters: outer layers see failures from inner layers; wrong ordering creates surprising behavior in production.

Add resilience4j-spring-boot3 and spring-boot-starter-aop. Annotations (@CircuitBreaker, @Retry, @Bulkhead, @RateLimiter, @TimeLimiter) apply via AOP around service methods or Feign clients.

Recommended execution order (outer → inner)

RateLimiter — reject overload before spending threads.
CircuitBreaker — fail fast if dependency unhealthy.
Bulkhead — cap concurrent calls.
TimeLimiter — bound wait time.
Retry — retry only the innermost business call (controversial: some teams place retry outside breaker; document team standard).

flowchart TB
  IN[Incoming call] --> RL[RateLimiter]
  RL --> CB[CircuitBreaker]
  CB --> BH[Bulkhead]
  BH --> TL[TimeLimiter]
  TL --> RT[Retry]
  RT --> SVC[Remote service]

@RateLimiter(name = "catalog")
@CircuitBreaker(name = "catalog", fallbackMethod = "catalogFallback")
@Bulkhead(name = "catalog")
@Retry(name = "catalog")
public ProductPage loadProduct(String sku) {
    return catalogClient.getProduct(sku);
}

Actuator and metrics

Expose /actuator/circuitbreakers, /actuator/ratelimiters, and /actuator/metrics/resilience4j.circuitbreaker.calls (with Prometheus registry). Alert on sustained OPEN state and rising not_permitted_calls from bulkheads—signs config is too tight or dependency degraded.

💡 Pro Tip

Use named instances per dependency (payment, inventory), not one global breaker—failure in Email must not block Payment.

Hystrix — legacy but still in the wild

Netflix Hystrix pioneered circuit breakers in JVM microservices. It entered maintenance mode; Resilience4j and service mesh sidecars replaced most greenfield usage—but you will still read Hystrix in older code and blog posts.

Hystrix provided thread-pool isolation, fallbacks, and the famous dashboard. Spring Cloud Netflix integrated via @HystrixCommand. Limitations that drove migration: blocking thread model overhead, no first-class reactive integration, and Netflix stopped active development.

Migration path: map Hystrix command groups to Resilience4j named instances; replace dashboard with Grafana + Micrometer; move edge resilience to Envoy/Istio where appropriate. Do not start new Hystrix projects in 2026.

Tuning and observability — prove it works before the incident

Resilience config in YAML is guesswork until load tests and game days validate it. Metrics tell you if breakers trip too eagerly or never open while users suffer.

Load testing with slow/fault-injecting dependencies (Toxiproxy, Litmus, Istio fault injection) verifies threads release under timeout and breakers open under error rate. Chaos engineering (controlled pod kills, network partition) validates fallbacks and sagas—not just happy-path unit tests with mocks.

Metrics to watch

Breaker state transitions per dependency
Retry count and exhausted retries (signals flaky network or bad thresholds)
Bulkhead rejected calls
Rate limiter wait time and timeouts
End-to-end latency p99 on user journeys vs sum of internal calls

Tie thresholds to SLOs from Observability: if Checkout p99 budget is 800 ms and Payment p99 is 400 ms, Inventory cannot use a 2 s timeout without eating the entire budget. Revisit after every major traffic pattern change (Black Friday, product launch).

🎯 Interview Tip

Explain a concrete stack: “2 s WebClient timeout, 3 retries with jitter on 503 only, breaker opens at 50% failures over 20 calls, fallback to cached catalog, alert if OPEN > 5 min.”