Latency regressed after deploy, errors still zero

Scenario

Right after release v1.43, p99 latency doubles. Error rate and HTTP 5xx stay flat—users only feel slowness. Leadership asks: did the deploy cause it? You must tie the metric shift to build version, find the slower code path, and decide whether to roll back or forward-fix—without guessing from deploy time alone.

After reading, you should be able to:

Prove or disprove deploy causation using version-labeled metrics and traces.
Separate rolling-deploy mix-up from a real regression in new code.
Compare slow traces before/after by build_version span attribute.
Roll back safely and add gates so silent latency regressions do not ship again.

Why — slow is not broken, but SLOs still burn

Latency regressions without errors are common: new N+1 query, extra RPC, debug logging left on, larger payloads, worse cache hit rate, or heavier serialization. Monitoring that only alerts on 5xx will miss them until customers complain or SLO budget drains. Deploys are the strongest temporal clue—but correlation needs version dimensions, not just clock time.

Deploy-adjacent causes (not always “bad code”)

Cold JVM — JIT still warming; usually fades in minutes (still worth noting).
Mixed versions during rollout — p99 blends fast old pods and slow new pods.
Config change shipped with image — feature flag, timeout, pool size.
Dependency bump — driver, HTTP client, framework patch.
Traffic shift — deploy coincided with marketing spike (prove with RPS overlay).
Downstream deploy — your service fine; partner slower — use traces.

Rollback is an experiment. If p99 drops immediately after revert to v1.42, causation is strong. If unchanged, look for external factors or incomplete rollback (cached config).

What — prove causation and locate the regression

Overlay deploy markers on latency — Grafana/deployment annotations; note canary vs full promote time.

Split metrics by version

histogram_quantile(0.99,
  sum by (le, version) (rate(http_server_requests_seconds_bucket[5m])))

p99 rises only on version="1.43.0" → deploy hypothesis strengthened.

Control for traffic — plot RPS, payload size, cache hit ratio alongside latency. Same RPS, worse p99 → real regression.
During rolling deploy: compare pod labels — deployment_revision=old vs new on same dashboard.
Trace comparison — filter traces by service.version or custom build attribute; diff critical path spans (new DB call? +400ms HTTP?).
Git / change log — what merged in v1.43: queries, serializers, logging level, default flags.
JVM signals on new version only — GC pause, CPU, allocation rate — links to GC pauses, high CPU.
Logs with correlation — new “debug” lines per request? — correlation id across canary pods.

Causation strength

Evidence	Interpretation
p99 up only on new version label	Strong deploy link
p99 up all versions, same time	Dependency, DB, or traffic—not your binary alone
Rollback restores p99 in < 5 min	Very strong; document incident
p99 up before deploy started	Deploy innocent; keep searching

How — rollback, fix forward, prevent repeat

Decision: rollback vs fix forward

Rollback	Fix forward
SLO burning, no quick patch	One-line flag off or config revert
Unknown root cause	Fix verified in staging with perf test
Data migration already applied	Rollback risky—mitigate with feature flag

Safe rollback steps

Announce incident channel; assign commander.
Revert deployment to last known good image/tag (not only scale up).
Confirm old version pod count = 100%; drain new revision.
Watch p99 by version—must separate old vs new until stable.
Invalidate CDN/config caches if app caches build-specific behavior.
Post-incident: root cause in code + perf test gap.

Prevention (platform)

Canary + automated promote — promote only if canary p99 within threshold vs baseline.
CI performance gate — load test staging on PR; fail if p99 regresses > X%.
Always emit version label on metrics and traces from build.info / env.
SLO alerts on latency, not only error rate.
Progressive delivery — 5% → 25% → 100% with pause and compare.

Canary query idea

# Alert if canary p99 > 1.2 × stable for 10m
p99{version="canary"} / p99{version="stable"} > 1.2

Verify fix or rollback

p99 at or below pre-deploy baseline for 30+ minutes.
Trace sample on new build: no new long spans vs baseline trace.
Document regression test added (integration or load).

Interview one-liner

“I align the deploy timestamp with p99 split by service version, compare traces and JVM metrics between old and new pods, and roll back if the new version is clearly slower—then add canary latency gates so error-free regressions cannot reach full traffic.”

Why — slow is not broken, but SLOs still burn

Deploy-adjacent causes (not always “bad code”)

What — prove causation and locate the regression

Causation strength

How — rollback, fix forward, prevent repeat

Decision: rollback vs fix forward

Safe rollback steps

Prevention (platform)

Canary query idea

Verify fix or rollback

Interview one-liner

Related scenarios