Handle very high concurrency safely

Scenario

You are designing (or rescuing) a service that must serve tens of thousands of RPS with predictable latency. Throwing more threads at the problem caused pool exhaustion, data races, and DB meltdown. You need a deliberate architecture: where state lives, how work is bounded, and which primitives actually scale.

After reading, you should be able to:

Choose between scaling out, async boundaries, and faster synchronous paths.
Apply bulkheads, backpressure, idempotency, and caching without unsafe shared state.
Match Java primitives (virtual threads, concurrent collections, queues) to workload shape.
Validate design with load tests and SLOs before production traffic.

Why — concurrency is a system design problem

High concurrency is not “use more threads.” It is maximizing useful work per resource while bounding failure blast radius. Every shared resource—CPU, heap, DB connections, locks, downstream APIs—has a ceiling. Good design makes contention rare, queues explicit, and overload visible early.

Goals (define before picking patterns)

Throughput — requests or events completed per second.
Latency SLO — p99 under target at expected peak (and during spikes).
Correctness — no races, idempotent retries, exactly-once or at-least-once clarity.
Availability — degrade gracefully; one slow dependency must not take down all paths.

What breaks naive scale-up

Mistake	Result
Mutable shared caches in the JVM	Races, GC pressure, pod skew
Unbounded in-memory queues	OOM, GC storms
`maxThreads` ≫ DB pool	Threads blocked on Hikari
Global locks on hot paths	BLOCKED threads, low CPU utilization
Sync call chain to many services	Latency multiplies; cascading timeouts
No backpressure	Retry storms amplify load

What — design checklist and pattern map

Layered decisions (top to bottom)

Stateless application tier — session and authority in DB/cache/ token; any pod can serve any request. Enables horizontal scale.
Partition work — shard by userId, tenantId, or Kafka partition key so hot keys do not serialize the world.
Separate read and write paths — read replicas, materialized views, or CQRS when read QPS ≫ write QPS.
Async for slow or spiky work — HTTP 202 + queue (SQS, Kafka, Rabbit) for email, reports, fraud scoring; API stays fast.
Cache with a strategy — TTL, stampede protection (computeIfAbsent, single-flight), invalidation rules; not “cache everything forever.”
Protect dependencies — timeouts, circuit breakers (Resilience4j), bulkheads per downstream.
Idempotency keys — safe client retries; dedupe table or unique constraint on Idempotency-Key.
Admission control — rate limits at gateway; max queue depth; reject or shed load before JVM dies.

Pattern → when to use

Pattern	Use when	Watch out
Horizontal scale (K8s HPA)	CPU-bound or stateless I/O; traffic grows	DB connection storm; cache coherence
Virtual threads (Java 21+)	Many blocking I/O calls per request	Still bound DB/CPU; pin carriers for native/JNI
Thread pool + bounded queue	CPU-bound batch, legacy servlet stack	Size vs exhaustion; always timeouts
Reactive (WebFlux)	High fan-out I/O, team fluency in reactive	Never block event loop; steep debug cost
Message queue	Decouple producers/consumers; absorb spikes	Ordering, poison messages, lag alerts
Bulkhead	One slow client or feature must not starve others	Separate executors / connection pools
Circuit breaker	Flaky downstream	Half-open storms; need fallbacks
Optimistic locking	Contended rows, retry OK	Client retry with backoff
Actor / single-writer per key	Complex in-memory state per entity	Operational complexity

Java primitives (safe defaults)

Prefer ConcurrentHashMap, LongAdder, immutable records, per-request objects.
Avoid shared HashMap, static mutable collections, parallelStream() on blocking I/O (starves ForkJoinPool).
Queues — ArrayBlockingQueue with fixed capacity; measure reject vs block policy.
HTTP clients — connection pool limits match expected concurrency to each host.

Capacity sketch (before build)

Peak RPS × p99 latency (s) ≈ in-flight requests per instance
× replicas ≤ downstream limits (DB connections, partner API QPS)

Example: 5k RPS, p99 = 200ms → ~1000 in-flight per region
If each needs 1 DB conn for 200ms → need ~1000 conns OR lower latency / cache / async

How — implement and prove the design

Reference request path (synchronous API)

Gateway: auth, rate limit, request size cap.
Handler: validate, idempotency check, no shared mutable state.
Read: cache → read replica; write: primary with short transaction.
Downstream calls: dedicated client with timeout + circuit breaker + bulkhead.
Response: consistent error model; retry only on idempotent operations.

Reference path (async / event-driven)

API persists intent + publishes event (outbox pattern avoids lost messages).
Consumers scale independently; partition count sets max parallelism per key.
Dead-letter queue + replay tooling for poison messages.
Status API or webhook for client completion.

Outbox pattern (sketch)

@Transactional
void placeOrder(Order o) {
  orderRepo.save(o);
  outboxRepo.save(new OutboxEvent("OrderPlaced", o.getId()));
}
// Separate poller publishes to Kafka — same DB transaction, no dual-write race

Operational guardrails

Load test at 1.5–2× expected peak; measure p99, error rate, pool busy %, DB connections.
Chaos: kill dependency, verify breaker and degradation—not total hang.
Alerts: queue lag, breaker open, thread pool > 85%, idempotency conflict rate.
Runbooks: scale replicas, disable feature flag, increase cache TTL, shed non-critical routes.

When to stop adding concurrency inside one JVM

If profiling shows lock contention, GC overhead from huge caches, or DB is always the bottleneck— scale the data tier (read replicas, sharding, denormalized reads) or move work off the request path before adding another thread pool.

Interview one-liner

“I keep the app tier stateless, partition by key, cache reads with clear invalidation, push slow work to queues, and protect each dependency with timeouts, bulkheads, and circuit breakers—then prove it with load tests against real SLOs and connection limits.”