More instances but latency did not improve

Scenario

You doubled Kubernetes replicas (or EC2 instances) expecting half the latency or double the throughput. p99 is unchanged or worse; the database errors increased. Horizontal scale only helps when work is parallelizable and the bottleneck moves with you—often it stays on one database, one partner API, or one hot partition.

After reading, you should be able to:

Why — the slowest serial step wins

Adding app servers increases concurrent clients to shared resources. If every request needs the same PostgreSQL primary, the same Redis hot key, or the same payment API quota, more pods only contend harder—latency flatlines or degrades. This is the distributed cousin of lock contention and row lock waits.

Why more pods do not help (common)

BottleneckWhat happens when you scale app
Database primaryMore connections, lock wait, CPU on DB — slow queries, pool limits
Hot row / partitionAll pods update same key — row locks
Downstream rate limitPartner 429; queues grow in each pod
Stateful sessionsLoad not evenly spread; one pod hot
Single-threaded dependencyLegacy mainframe, license-limited API
Saturated network/diskShared NFS, logging pipeline
Coordination overheadMore cache misses, stampede on cold pods
Broken pods scaledHPA adds unhealthy replicas — leaks, CPU waste

Throughput test: if total RPS scales linearly with pods but latency per request is flat, the bottleneck is likely downstream. If total RPS is flat, you hit a serial limit.

What — prove where time and capacity go

  1. Measure scaling efficiency
    efficiency = (RPS at N pods) / (RPS at 1 pod) / N
    # ≈ 1.0 → good horizontal scale
    # ≪ 1.0 → serial bottleneck or contention
  2. Compare p99 per pod vs aggregate — if each pod’s p99 is high, problem is per-request (slow code/DB). If per-pod OK but global bad, imbalance or shared resource.
  3. Distributed trace — same long span on every trace regardless of pod count (DB, external HTTP).
  4. DB connection budget
    total ≈ pods × hikari.maximumPoolSize
    # vs Postgres max_connections
    New pods → “too many clients” or pool wait — pool guide.
  5. DB metrics under scale test — CPU, IOPS, lock wait, replication lag — scale DB before more app pods.
  6. Load balancer — even distribution? sticky sessions pinning traffic?
  7. Autoscaling signal — scaling on CPU when pods wait on DB misleads HPA — high CPU low traffic vs I/O wait.

Decision tree

Total RPS flat, latency up?
  → DB/downstream limit or hot partition

Total RPS grows, latency flat?
  → Per-request work unchanged; need faster queries or cache

Total RPS grows, latency improves?
  → Scale was working; check cost/efficiency only

Latency worse after scale?
  → Contention, connection storm, or bad pods added

How — scale the right thing

1. Fix per-request cost first

2. Scale data and async tiers

3. Safe app scaling checklist

  1. Stateless pods; no local authoritative state.
  2. Pool size × max pods ≤ DB connection budget.
  3. HPA on RPS or custom latency SLO, not CPU alone.
  4. Readiness excludes broken instances from LB.
  5. Load test 2× pods in staging with production-like data volume.

4. When to stop adding pods

If marginal pod adds <5% throughput or increases error rate, stop and invest in bottleneck tier. Document max efficient replica count in runbook.

Verify

  1. Load test: 1 → 2 → 4 pods; record RPS and p99 at fixed client concurrency.
  2. DB connections and lock waits stable in allowed range.
  3. Trace waterfall shows shorter or parallelized work, not duplicated contention.

Interview one-liner

“I check whether total throughput scales with pod count and where traces spend time. If the DB or a hot key is serial, more app instances only add connections and contention—I fix query cost, pool budget, and scale the database, cache, or queue instead of blindly adding pods.”

Related scenarios