More instances but latency did not improve
Scenario
You doubled Kubernetes replicas (or EC2 instances) expecting half the latency or double the throughput. p99 is unchanged or worse; the database errors increased. Horizontal scale only helps when work is parallelizable and the bottleneck moves with you—often it stays on one database, one partner API, or one hot partition.
After reading, you should be able to:
- Explain when scale-out helps vs wastes cost (Amdahl, serial resources).
- Find the real bottleneck with traces, DB metrics, and throughput-per-pod math.
- Avoid multiplying DB connections and lock contention with more pods.
- Scale the correct tier: data, cache, queue, or fix serial code.
Why — the slowest serial step wins
Adding app servers increases concurrent clients to shared resources. If every request needs the same PostgreSQL primary, the same Redis hot key, or the same payment API quota, more pods only contend harder—latency flatlines or degrades. This is the distributed cousin of lock contention and row lock waits.
Why more pods do not help (common)
| Bottleneck | What happens when you scale app |
|---|---|
| Database primary | More connections, lock wait, CPU on DB — slow queries, pool limits |
| Hot row / partition | All pods update same key — row locks |
| Downstream rate limit | Partner 429; queues grow in each pod |
| Stateful sessions | Load not evenly spread; one pod hot |
| Single-threaded dependency | Legacy mainframe, license-limited API |
| Saturated network/disk | Shared NFS, logging pipeline |
| Coordination overhead | More cache misses, stampede on cold pods |
| Broken pods scaled | HPA adds unhealthy replicas — leaks, CPU waste |
Throughput test: if total RPS scales linearly with pods but latency per request is flat, the bottleneck is likely downstream. If total RPS is flat, you hit a serial limit.
What — prove where time and capacity go
-
Measure scaling efficiency
efficiency = (RPS at N pods) / (RPS at 1 pod) / N # ≈ 1.0 → good horizontal scale # ≪ 1.0 → serial bottleneck or contention
- Compare p99 per pod vs aggregate — if each pod’s p99 is high, problem is per-request (slow code/DB). If per-pod OK but global bad, imbalance or shared resource.
- Distributed trace — same long span on every trace regardless of pod count (DB, external HTTP).
-
DB connection budget
total ≈ pods × hikari.maximumPoolSize # vs Postgres max_connections
New pods → “too many clients” or pool wait — pool guide. - DB metrics under scale test — CPU, IOPS, lock wait, replication lag — scale DB before more app pods.
- Load balancer — even distribution? sticky sessions pinning traffic?
- Autoscaling signal — scaling on CPU when pods wait on DB misleads HPA — high CPU low traffic vs I/O wait.
Decision tree
Total RPS flat, latency up? → DB/downstream limit or hot partition Total RPS grows, latency flat? → Per-request work unchanged; need faster queries or cache Total RPS grows, latency improves? → Scale was working; check cost/efficiency only Latency worse after scale? → Contention, connection storm, or bad pods added
How — scale the right thing
1. Fix per-request cost first
- N+1, indexes, cache hot paths — DB slow.
- Timeouts and circuit breakers so one slow dependency does not block all threads — concurrency design.
2. Scale data and async tiers
- Read replicas for read-heavy APIs — replica routing.
- Shard by tenant/user where writes dominate.
- Queue absorbing spikes; workers scale independently.
- PgBouncer / RDS larger instance before pod #50.
3. Safe app scaling checklist
- Stateless pods; no local authoritative state.
- Pool size × max pods ≤ DB connection budget.
- HPA on RPS or custom latency SLO, not CPU alone.
- Readiness excludes broken instances from LB.
- Load test 2× pods in staging with production-like data volume.
4. When to stop adding pods
If marginal pod adds <5% throughput or increases error rate, stop and invest in bottleneck tier. Document max efficient replica count in runbook.
Verify
- Load test: 1 → 2 → 4 pods; record RPS and p99 at fixed client concurrency.
- DB connections and lock waits stable in allowed range.
- Trace waterfall shows shorter or parallelized work, not duplicated contention.
Interview one-liner
“I check whether total throughput scales with pod count and where traces spend time. If the DB or a hot key is serial, more app instances only add connections and contention—I fix query cost, pool budget, and scale the database, cache, or queue instead of blindly adding pods.”