Latency regressed after deploy, errors still zero
Scenario
Right after release v1.43, p99 latency doubles. Error rate and HTTP 5xx stay flat—users only feel slowness. Leadership asks: did the deploy cause it? You must tie the metric shift to build version, find the slower code path, and decide whether to roll back or forward-fix—without guessing from deploy time alone.
After reading, you should be able to:
- Prove or disprove deploy causation using version-labeled metrics and traces.
- Separate rolling-deploy mix-up from a real regression in new code.
- Compare slow traces before/after by
build_versionspan attribute. - Roll back safely and add gates so silent latency regressions do not ship again.
Why — slow is not broken, but SLOs still burn
Latency regressions without errors are common: new N+1 query, extra RPC, debug logging left on, larger payloads, worse cache hit rate, or heavier serialization. Monitoring that only alerts on 5xx will miss them until customers complain or SLO budget drains. Deploys are the strongest temporal clue—but correlation needs version dimensions, not just clock time.
Deploy-adjacent causes (not always “bad code”)
- Cold JVM — JIT still warming; usually fades in minutes (still worth noting).
- Mixed versions during rollout — p99 blends fast old pods and slow new pods.
- Config change shipped with image — feature flag, timeout, pool size.
- Dependency bump — driver, HTTP client, framework patch.
- Traffic shift — deploy coincided with marketing spike (prove with RPS overlay).
- Downstream deploy — your service fine; partner slower — use traces.
Rollback is an experiment. If p99 drops immediately after revert to v1.42, causation is strong. If unchanged, look for external factors or incomplete rollback (cached config).
What — prove causation and locate the regression
- Overlay deploy markers on latency — Grafana/deployment annotations; note canary vs full promote time.
-
Split metrics by version
histogram_quantile(0.99, sum by (le, version) (rate(http_server_requests_seconds_bucket[5m])))
p99 rises only onversion="1.43.0"→ deploy hypothesis strengthened. - Control for traffic — plot RPS, payload size, cache hit ratio alongside latency. Same RPS, worse p99 → real regression.
-
During rolling deploy: compare pod labels
—
deployment_revision=oldvsnewon same dashboard. -
Trace comparison
— filter traces by
service.versionor custombuildattribute; diff critical path spans (new DB call? +400ms HTTP?). -
Git / change log
— what merged in
v1.43: queries, serializers, logging level, default flags. - JVM signals on new version only — GC pause, CPU, allocation rate — links to GC pauses, high CPU.
- Logs with correlation — new “debug” lines per request? — correlation id across canary pods.
Causation strength
| Evidence | Interpretation |
|---|---|
| p99 up only on new version label | Strong deploy link |
| p99 up all versions, same time | Dependency, DB, or traffic—not your binary alone |
| Rollback restores p99 in < 5 min | Very strong; document incident |
| p99 up before deploy started | Deploy innocent; keep searching |
How — rollback, fix forward, prevent repeat
Decision: rollback vs fix forward
| Rollback | Fix forward |
|---|---|
| SLO burning, no quick patch | One-line flag off or config revert |
| Unknown root cause | Fix verified in staging with perf test |
| Data migration already applied | Rollback risky—mitigate with feature flag |
Safe rollback steps
- Announce incident channel; assign commander.
- Revert deployment to last known good image/tag (not only scale up).
- Confirm old version pod count = 100%; drain new revision.
- Watch p99 by version—must separate old vs new until stable.
- Invalidate CDN/config caches if app caches build-specific behavior.
- Post-incident: root cause in code + perf test gap.
Prevention (platform)
- Canary + automated promote — promote only if canary p99 within threshold vs baseline.
- CI performance gate — load test staging on PR; fail if p99 regresses > X%.
- Always emit
versionlabel on metrics and traces frombuild.info/ env. - SLO alerts on latency, not only error rate.
- Progressive delivery — 5% → 25% → 100% with pause and compare.
Canary query idea
# Alert if canary p99 > 1.2 × stable for 10m
p99{version="canary"} / p99{version="stable"} > 1.2
Verify fix or rollback
- p99 at or below pre-deploy baseline for 30+ minutes.
- Trace sample on new build: no new long spans vs baseline trace.
- Document regression test added (integration or load).
Interview one-liner
“I align the deploy timestamp with p99 split by service version, compare traces and JVM metrics between old and new pods, and roll back if the new version is clearly slower—then add canary latency gates so error-free regressions cannot reach full traffic.”