Downstream slow — circuit breaker and backpressure

Scenario

A partner API or internal microservice degrades from 50ms to 30s. Your service keeps calling it: threads block, queues grow, memory rises, and everything times out—even endpoints that do not use that dependency. You need fail fast, bounded concurrency per downstream, and a breaker that stops hammering the sick service.

After reading, you should be able to:

Explain circuit breaker states and how they differ from a timeout alone.
Combine timeouts, bulkheads, and bounded queues to stop unbounded work.
Configure Resilience4j-style breakers and measure open/half-open behavior.
Design fallbacks without hiding data corruption — pair with idempotency.

Why — slow dependencies infect the caller

Without limits, each incoming request may block a thread waiting on a sick downstream. Under load, all threads sit in socket read → thread pool exhausted, new requests queue at the edge, and the failure propagates upstream. A circuit breaker stops calling the failing dependency for a cooldown period after error rate or slowness crosses a threshold—fail fast instead of queue forever. Backpressure means rejecting or shedding work when you cannot process it, rather than buffering without bound.

Timeout alone is not enough

Every request still tries the call until timeout—wastes threads and overloads the already sick service.
Breaker opens → immediate rejection (or fallback) for most calls; occasional probes in half-open.

Circuit states

State	Behavior
Closed	Normal calls; failures counted
Open	Calls fail fast (no downstream hit)
Half-open	Limited trial calls; success → closed, fail → open

What — recognize unbounded queuing

Symptoms — one downstream span dominates traces; thread dump shows many threads in HTTP read to same host; memory/queue depth grows; error rate rises service-wide.
Metrics — Resilience4j resilience4j.circuitbreaker.state, call not permitted count; Tomcat queue length; rejected executions.
Which dependency — trace aggregation by downstream service name.
Unbounded structures — LinkedBlockingQueue default capacity = Integer.MAX_VALUE; unbounded CompletableFuture chains; no limit on async retries.

How — layer defenses

1. Timeout (every outbound call)

# HttpClient / RestTemplate / WebClient
connectTimeout: 2s
readTimeout: 5s

Shorter than upstream gateway timeout — 502 guide.

2. Circuit breaker (Resilience4j example)

CircuitBreakerConfig config = CircuitBreakerConfig.custom()
  .failureRateThreshold(50)
  .waitDurationInOpenState(Duration.ofSeconds(30))
  .slidingWindowSize(20)
  .permittedNumberOfCallsInHalfOpenState(5)
  .slowCallDurationThreshold(Duration.ofSeconds(3))
  .slowCallRateThreshold(50)
  .build();

CircuitBreaker breaker = CircuitBreaker.of("paymentApi", config);
Supplier<PaymentResult> decorated = CircuitBreaker
  .decorateSupplier(breaker, () -> paymentClient.charge(req));

3. Bulkhead (limit concurrent calls per dependency)

Bulkhead bulkhead = Bulkhead.of("paymentApi",
  BulkheadConfig.custom().maxConcurrentCalls(10).build());

// Only 10 threads can call payment at once; rest get BulkheadFullException fast

Prevents one slow API from consuming all HTTP workers.

4. Bounded queue + reject policy

ThreadPoolExecutor executor = new ThreadPoolExecutor(
  8, 8, 0L, TimeUnit.MILLISECONDS,
  new ArrayBlockingQueue<>(100),  // bounded
  new ThreadPoolExecutor.AbortPolicy());  // fail fast when full

5. Fallback (careful)

Return cached last-good response, degraded feature off, or 503 with retry-after.
Do not fake success for payments—fail closed.
Fallback must be idempotent-safe — idempotency guide.

6. Stack order (typical)

Request → Bulkhead → CircuitBreaker → TimeLimiter → HTTP call

7. Observability and ops

Alert when breaker open > 1 min.
Dashboard: failure rate, slow call rate, not-permitted calls.
Runbook: dependency status page; disable feature flag for optional path.

Verify

Chaos: downstream fixed 10s delay → breaker opens; your API p99 stays bounded.
Thread count stable; not all blocked on one host.
Half-open recovery when dependency heals.

Interview one-liner

“I set aggressive timeouts, a bulkhead cap per downstream, and a circuit breaker on failure and slow-call rate so we fail fast instead of queuing unbounded threads— with a safe fallback or 503, and metrics on breaker state.”