Downstream slow — circuit breaker and backpressure
Scenario
A partner API or internal microservice degrades from 50ms to 30s. Your service keeps calling it: threads block, queues grow, memory rises, and everything times out—even endpoints that do not use that dependency. You need fail fast, bounded concurrency per downstream, and a breaker that stops hammering the sick service.
After reading, you should be able to:
- Explain circuit breaker states and how they differ from a timeout alone.
- Combine timeouts, bulkheads, and bounded queues to stop unbounded work.
- Configure Resilience4j-style breakers and measure open/half-open behavior.
- Design fallbacks without hiding data corruption — pair with idempotency.
Why — slow dependencies infect the caller
Without limits, each incoming request may block a thread waiting on a sick downstream. Under load, all threads sit in socket read → thread pool exhausted, new requests queue at the edge, and the failure propagates upstream. A circuit breaker stops calling the failing dependency for a cooldown period after error rate or slowness crosses a threshold—fail fast instead of queue forever. Backpressure means rejecting or shedding work when you cannot process it, rather than buffering without bound.
Timeout alone is not enough
- Every request still tries the call until timeout—wastes threads and overloads the already sick service.
- Breaker opens → immediate rejection (or fallback) for most calls; occasional probes in half-open.
Circuit states
| State | Behavior |
|---|---|
| Closed | Normal calls; failures counted |
| Open | Calls fail fast (no downstream hit) |
| Half-open | Limited trial calls; success → closed, fail → open |
What — recognize unbounded queuing
- Symptoms — one downstream span dominates traces; thread dump shows many threads in HTTP read to same host; memory/queue depth grows; error rate rises service-wide.
-
Metrics
— Resilience4j
resilience4j.circuitbreaker.state, call not permitted count; Tomcat queue length; rejected executions. - Which dependency — trace aggregation by downstream service name.
-
Unbounded structures
—
LinkedBlockingQueuedefault capacity = Integer.MAX_VALUE; unboundedCompletableFuturechains; no limit on async retries.
How — layer defenses
1. Timeout (every outbound call)
# HttpClient / RestTemplate / WebClient connectTimeout: 2s readTimeout: 5s
Shorter than upstream gateway timeout — 502 guide.
2. Circuit breaker (Resilience4j example)
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50)
.waitDurationInOpenState(Duration.ofSeconds(30))
.slidingWindowSize(20)
.permittedNumberOfCallsInHalfOpenState(5)
.slowCallDurationThreshold(Duration.ofSeconds(3))
.slowCallRateThreshold(50)
.build();
CircuitBreaker breaker = CircuitBreaker.of("paymentApi", config);
Supplier<PaymentResult> decorated = CircuitBreaker
.decorateSupplier(breaker, () -> paymentClient.charge(req));
3. Bulkhead (limit concurrent calls per dependency)
Bulkhead bulkhead = Bulkhead.of("paymentApi",
BulkheadConfig.custom().maxConcurrentCalls(10).build());
// Only 10 threads can call payment at once; rest get BulkheadFullException fast
Prevents one slow API from consuming all HTTP workers.
4. Bounded queue + reject policy
ThreadPoolExecutor executor = new ThreadPoolExecutor( 8, 8, 0L, TimeUnit.MILLISECONDS, new ArrayBlockingQueue<>(100), // bounded new ThreadPoolExecutor.AbortPolicy()); // fail fast when full
5. Fallback (careful)
- Return cached last-good response, degraded feature off, or 503 with retry-after.
- Do not fake success for payments—fail closed.
- Fallback must be idempotent-safe — idempotency guide.
6. Stack order (typical)
Request → Bulkhead → CircuitBreaker → TimeLimiter → HTTP call
7. Observability and ops
- Alert when breaker open > 1 min.
- Dashboard: failure rate, slow call rate, not-permitted calls.
- Runbook: dependency status page; disable feature flag for optional path.
Verify
- Chaos: downstream fixed 10s delay → breaker opens; your API p99 stays bounded.
- Thread count stable; not all blocked on one host.
- Half-open recovery when dependency heals.
Interview one-liner
“I set aggressive timeouts, a bulkhead cap per downstream, and a circuit breaker on failure and slow-call rate so we fail fast instead of queuing unbounded threads— with a safe fallback or 503, and metrics on breaker state.”