Handle very high concurrency safely

Scenario

You are designing (or rescuing) a service that must serve tens of thousands of RPS with predictable latency. Throwing more threads at the problem caused pool exhaustion, data races, and DB meltdown. You need a deliberate architecture: where state lives, how work is bounded, and which primitives actually scale.

After reading, you should be able to:

Why — concurrency is a system design problem

High concurrency is not “use more threads.” It is maximizing useful work per resource while bounding failure blast radius. Every shared resource—CPU, heap, DB connections, locks, downstream APIs—has a ceiling. Good design makes contention rare, queues explicit, and overload visible early.

Goals (define before picking patterns)

What breaks naive scale-up

MistakeResult
Mutable shared caches in the JVMRaces, GC pressure, pod skew
Unbounded in-memory queuesOOM, GC storms
maxThreads ≫ DB poolThreads blocked on Hikari
Global locks on hot pathsBLOCKED threads, low CPU utilization
Sync call chain to many servicesLatency multiplies; cascading timeouts
No backpressureRetry storms amplify load

What — design checklist and pattern map

Layered decisions (top to bottom)

  1. Stateless application tier — session and authority in DB/cache/ token; any pod can serve any request. Enables horizontal scale.
  2. Partition work — shard by userId, tenantId, or Kafka partition key so hot keys do not serialize the world.
  3. Separate read and write paths — read replicas, materialized views, or CQRS when read QPS ≫ write QPS.
  4. Async for slow or spiky work — HTTP 202 + queue (SQS, Kafka, Rabbit) for email, reports, fraud scoring; API stays fast.
  5. Cache with a strategy — TTL, stampede protection (computeIfAbsent, single-flight), invalidation rules; not “cache everything forever.”
  6. Protect dependencies — timeouts, circuit breakers (Resilience4j), bulkheads per downstream.
  7. Idempotency keys — safe client retries; dedupe table or unique constraint on Idempotency-Key.
  8. Admission control — rate limits at gateway; max queue depth; reject or shed load before JVM dies.

Pattern → when to use

PatternUse whenWatch out
Horizontal scale (K8s HPA)CPU-bound or stateless I/O; traffic growsDB connection storm; cache coherence
Virtual threads (Java 21+)Many blocking I/O calls per requestStill bound DB/CPU; pin carriers for native/JNI
Thread pool + bounded queueCPU-bound batch, legacy servlet stackSize vs exhaustion; always timeouts
Reactive (WebFlux)High fan-out I/O, team fluency in reactiveNever block event loop; steep debug cost
Message queueDecouple producers/consumers; absorb spikesOrdering, poison messages, lag alerts
BulkheadOne slow client or feature must not starve othersSeparate executors / connection pools
Circuit breakerFlaky downstreamHalf-open storms; need fallbacks
Optimistic lockingContended rows, retry OKClient retry with backoff
Actor / single-writer per keyComplex in-memory state per entityOperational complexity

Java primitives (safe defaults)

Capacity sketch (before build)

Peak RPS × p99 latency (s) ≈ in-flight requests per instance
× replicas ≤ downstream limits (DB connections, partner API QPS)

Example: 5k RPS, p99 = 200ms → ~1000 in-flight per region
If each needs 1 DB conn for 200ms → need ~1000 conns OR lower latency / cache / async

How — implement and prove the design

Reference request path (synchronous API)

  1. Gateway: auth, rate limit, request size cap.
  2. Handler: validate, idempotency check, no shared mutable state.
  3. Read: cache → read replica; write: primary with short transaction.
  4. Downstream calls: dedicated client with timeout + circuit breaker + bulkhead.
  5. Response: consistent error model; retry only on idempotent operations.

Reference path (async / event-driven)

  1. API persists intent + publishes event (outbox pattern avoids lost messages).
  2. Consumers scale independently; partition count sets max parallelism per key.
  3. Dead-letter queue + replay tooling for poison messages.
  4. Status API or webhook for client completion.

Outbox pattern (sketch)

@Transactional
void placeOrder(Order o) {
  orderRepo.save(o);
  outboxRepo.save(new OutboxEvent("OrderPlaced", o.getId()));
}
// Separate poller publishes to Kafka — same DB transaction, no dual-write race

Operational guardrails

When to stop adding concurrency inside one JVM

If profiling shows lock contention, GC overhead from huge caches, or DB is always the bottleneck— scale the data tier (read replicas, sharding, denormalized reads) or move work off the request path before adding another thread pool.

Interview one-liner

“I keep the app tier stateless, partition by key, cache reads with clear invalidation, push slow work to queues, and protect each dependency with timeouts, bulkheads, and circuit breakers—then prove it with load tests against real SLOs and connection limits.”

Related scenarios