Latency regressed after deploy, errors still zero

Scenario

Right after release v1.43, p99 latency doubles. Error rate and HTTP 5xx stay flat—users only feel slowness. Leadership asks: did the deploy cause it? You must tie the metric shift to build version, find the slower code path, and decide whether to roll back or forward-fix—without guessing from deploy time alone.

After reading, you should be able to:

Why — slow is not broken, but SLOs still burn

Latency regressions without errors are common: new N+1 query, extra RPC, debug logging left on, larger payloads, worse cache hit rate, or heavier serialization. Monitoring that only alerts on 5xx will miss them until customers complain or SLO budget drains. Deploys are the strongest temporal clue—but correlation needs version dimensions, not just clock time.

Deploy-adjacent causes (not always “bad code”)

Rollback is an experiment. If p99 drops immediately after revert to v1.42, causation is strong. If unchanged, look for external factors or incomplete rollback (cached config).

What — prove causation and locate the regression

  1. Overlay deploy markers on latency — Grafana/deployment annotations; note canary vs full promote time.
  2. Split metrics by version
    histogram_quantile(0.99,
      sum by (le, version) (rate(http_server_requests_seconds_bucket[5m])))
    p99 rises only on version="1.43.0" → deploy hypothesis strengthened.
  3. Control for traffic — plot RPS, payload size, cache hit ratio alongside latency. Same RPS, worse p99 → real regression.
  4. During rolling deploy: compare pod labelsdeployment_revision=old vs new on same dashboard.
  5. Trace comparison — filter traces by service.version or custom build attribute; diff critical path spans (new DB call? +400ms HTTP?).
  6. Git / change log — what merged in v1.43: queries, serializers, logging level, default flags.
  7. JVM signals on new version only — GC pause, CPU, allocation rate — links to GC pauses, high CPU.
  8. Logs with correlation — new “debug” lines per request? — correlation id across canary pods.

Causation strength

EvidenceInterpretation
p99 up only on new version labelStrong deploy link
p99 up all versions, same timeDependency, DB, or traffic—not your binary alone
Rollback restores p99 in < 5 minVery strong; document incident
p99 up before deploy startedDeploy innocent; keep searching

How — rollback, fix forward, prevent repeat

Decision: rollback vs fix forward

RollbackFix forward
SLO burning, no quick patchOne-line flag off or config revert
Unknown root causeFix verified in staging with perf test
Data migration already appliedRollback risky—mitigate with feature flag

Safe rollback steps

  1. Announce incident channel; assign commander.
  2. Revert deployment to last known good image/tag (not only scale up).
  3. Confirm old version pod count = 100%; drain new revision.
  4. Watch p99 by version—must separate old vs new until stable.
  5. Invalidate CDN/config caches if app caches build-specific behavior.
  6. Post-incident: root cause in code + perf test gap.

Prevention (platform)

Canary query idea

# Alert if canary p99 > 1.2 × stable for 10m
p99{version="canary"} / p99{version="stable"} > 1.2

Verify fix or rollback

  1. p99 at or below pre-deploy baseline for 30+ minutes.
  2. Trace sample on new build: no new long spans vs baseline trace.
  3. Document regression test added (integration or load).

Interview one-liner

“I align the deploy timestamp with p99 split by service version, compare traces and JVM metrics between old and new pods, and roll back if the new version is clearly slower—then add canary latency gates so error-free regressions cannot reach full traffic.”

Related scenarios