Handle very high concurrency safely
Scenario
You are designing (or rescuing) a service that must serve tens of thousands of RPS with predictable latency. Throwing more threads at the problem caused pool exhaustion, data races, and DB meltdown. You need a deliberate architecture: where state lives, how work is bounded, and which primitives actually scale.
After reading, you should be able to:
- Choose between scaling out, async boundaries, and faster synchronous paths.
- Apply bulkheads, backpressure, idempotency, and caching without unsafe shared state.
- Match Java primitives (virtual threads, concurrent collections, queues) to workload shape.
- Validate design with load tests and SLOs before production traffic.
Why — concurrency is a system design problem
High concurrency is not “use more threads.” It is maximizing useful work per resource while bounding failure blast radius. Every shared resource—CPU, heap, DB connections, locks, downstream APIs—has a ceiling. Good design makes contention rare, queues explicit, and overload visible early.
Goals (define before picking patterns)
- Throughput — requests or events completed per second.
- Latency SLO — p99 under target at expected peak (and during spikes).
- Correctness — no races, idempotent retries, exactly-once or at-least-once clarity.
- Availability — degrade gracefully; one slow dependency must not take down all paths.
What breaks naive scale-up
| Mistake | Result |
|---|---|
| Mutable shared caches in the JVM | Races, GC pressure, pod skew |
| Unbounded in-memory queues | OOM, GC storms |
maxThreads ≫ DB pool | Threads blocked on Hikari |
| Global locks on hot paths | BLOCKED threads, low CPU utilization |
| Sync call chain to many services | Latency multiplies; cascading timeouts |
| No backpressure | Retry storms amplify load |
What — design checklist and pattern map
Layered decisions (top to bottom)
- Stateless application tier — session and authority in DB/cache/ token; any pod can serve any request. Enables horizontal scale.
-
Partition work
— shard by
userId,tenantId, or Kafka partition key so hot keys do not serialize the world. - Separate read and write paths — read replicas, materialized views, or CQRS when read QPS ≫ write QPS.
- Async for slow or spiky work — HTTP 202 + queue (SQS, Kafka, Rabbit) for email, reports, fraud scoring; API stays fast.
-
Cache with a strategy
— TTL, stampede protection (
computeIfAbsent, single-flight), invalidation rules; not “cache everything forever.” - Protect dependencies — timeouts, circuit breakers (Resilience4j), bulkheads per downstream.
-
Idempotency keys
— safe client retries; dedupe table or unique constraint on
Idempotency-Key. - Admission control — rate limits at gateway; max queue depth; reject or shed load before JVM dies.
Pattern → when to use
| Pattern | Use when | Watch out |
|---|---|---|
| Horizontal scale (K8s HPA) | CPU-bound or stateless I/O; traffic grows | DB connection storm; cache coherence |
| Virtual threads (Java 21+) | Many blocking I/O calls per request | Still bound DB/CPU; pin carriers for native/JNI |
| Thread pool + bounded queue | CPU-bound batch, legacy servlet stack | Size vs exhaustion; always timeouts |
| Reactive (WebFlux) | High fan-out I/O, team fluency in reactive | Never block event loop; steep debug cost |
| Message queue | Decouple producers/consumers; absorb spikes | Ordering, poison messages, lag alerts |
| Bulkhead | One slow client or feature must not starve others | Separate executors / connection pools |
| Circuit breaker | Flaky downstream | Half-open storms; need fallbacks |
| Optimistic locking | Contended rows, retry OK | Client retry with backoff |
| Actor / single-writer per key | Complex in-memory state per entity | Operational complexity |
Java primitives (safe defaults)
- Prefer
ConcurrentHashMap,LongAdder, immutable records, per-request objects. - Avoid shared
HashMap, static mutable collections,parallelStream()on blocking I/O (starvesForkJoinPool). - Queues —
ArrayBlockingQueuewith fixed capacity; measure reject vs block policy. - HTTP clients — connection pool limits match expected concurrency to each host.
Capacity sketch (before build)
Peak RPS × p99 latency (s) ≈ in-flight requests per instance × replicas ≤ downstream limits (DB connections, partner API QPS) Example: 5k RPS, p99 = 200ms → ~1000 in-flight per region If each needs 1 DB conn for 200ms → need ~1000 conns OR lower latency / cache / async
How — implement and prove the design
Reference request path (synchronous API)
- Gateway: auth, rate limit, request size cap.
- Handler: validate, idempotency check, no shared mutable state.
- Read: cache → read replica; write: primary with short transaction.
- Downstream calls: dedicated client with timeout + circuit breaker + bulkhead.
- Response: consistent error model; retry only on idempotent operations.
Reference path (async / event-driven)
- API persists intent + publishes event (outbox pattern avoids lost messages).
- Consumers scale independently; partition count sets max parallelism per key.
- Dead-letter queue + replay tooling for poison messages.
- Status API or webhook for client completion.
Outbox pattern (sketch)
@Transactional
void placeOrder(Order o) {
orderRepo.save(o);
outboxRepo.save(new OutboxEvent("OrderPlaced", o.getId()));
}
// Separate poller publishes to Kafka — same DB transaction, no dual-write race
Operational guardrails
- Load test at 1.5–2× expected peak; measure p99, error rate, pool busy %, DB connections.
- Chaos: kill dependency, verify breaker and degradation—not total hang.
- Alerts: queue lag, breaker open, thread pool > 85%, idempotency conflict rate.
- Runbooks: scale replicas, disable feature flag, increase cache TTL, shed non-critical routes.
When to stop adding concurrency inside one JVM
If profiling shows lock contention, GC overhead from huge caches, or DB is always the bottleneck— scale the data tier (read replicas, sharding, denormalized reads) or move work off the request path before adding another thread pool.
Interview one-liner
“I keep the app tier stateless, partition by key, cache reads with clear invalidation, push slow work to queues, and protect each dependency with timeouts, bulkheads, and circuit breakers—then prove it with load tests against real SLOs and connection limits.”