API works locally but fails in production
Scenario
Developers verify the endpoint on localhost—200 OK. In production you see 401, 403, 500, timeouts, or silent wrong data. Reproducing locally fails. The gap is almost never “Java is different”; it is environment, configuration, network, data, and dependencies that differ between laptop and prod.
After reading, you should be able to:
- Run a ordered first-pass checklist before deep JVM debugging.
- Compare the same request against prod and local (status, headers, body, logs).
- Map failures to config, network/TLS, auth/IAM, data, and missing downstream services.
- Shrink the gap with staging parity and smoke tests in CI.
Why — local is not a small copy of prod
Local development optimizes for speed: embedded DB, mocked payments, admin credentials, HTTP not HTTPS, no load balancer. Production adds real policies—secrets, firewalls, WAF, IAM roles, production data shape, and multi-hop networking. The same code path can succeed locally and fail in prod because inputs to the path changed, not because the compiler behaved differently.
Top mismatch categories
| Category | Local typical | Production typical |
|---|---|---|
| Configuration | application-dev.yml, defaults | Env vars, ConfigMaps, secrets, feature flags |
| Networking | localhost, no TLS | DNS, mTLS, security groups, egress proxy |
| Auth | Disabled or test JWT | OAuth, IAM role, API keys, IP allowlist |
| Data | Empty or seed data | Volume, edge cases, time zones, charset |
| Dependencies | Mocks / Docker only you run | Real SQS, Kafka ACLs, partner sandbox vs prod |
| Runtime | IDE classpath, more memory | Container image, JDK flags, read-only filesystem |
| Scale | One user, one pod | Load balancer, multiple replicas, races |
Symptom ≠ root cause. A 500 in prod may be a downstream 403 masked by a generic catch block. Always read prod logs and the actual exception before changing code.
What — investigate in this order
- Capture the failing prod fact — HTTP status, response body, request id, timestamp, pod name, trace id. Without this you are guessing.
-
Reproduce with the same client call
curl -sv -X POST 'https://api.prod.example/orders' \ -H 'Authorization: Bearer …' \ -H 'Content-Type: application/json' \ -d @payload.json
Compare to local URL, headers, and body byte-for-byte (including trailing slashes, query order). - Find the log line in prod for that request id — stack trace, “connection refused,” “Access Denied,” “SSL handshake failure,” validation error.
-
Config diff (most common fix)
- Active Spring profile:
SPRING_PROFILES_ACTIVE - Missing env var (empty string vs unset)
- Wrong JDBC URL (host, SSL mode, schema)
- Feature flag off in prod
- Clock/skew: token “not yet valid”
- Active Spring profile:
-
Network path from pod
kubectl exec -it <pod> -- curl -sv https://partner-api/health kubectl exec -it <pod> -- nslookup db.prod.internal
Fails in pod but works on laptop → SG, NACL, egress, private link, DNS. - TLS / trust — corporate CA not in container truststore; cert hostname mismatch; TLS 1.2 required by partner.
- Auth & IAM — pod service account lacks S3/SQS permission; OAuth client secret rotated; prod API key scope.
- Data-dependent logic — null FK in prod, duplicate unique key, legacy row breaks new validation; timezone “today” differs.
- Container vs IDE differences — case-sensitive paths on Linux image; file not in JAR; profile-specific bean missing in prod build.
- Only fails under load? — then leave “works locally” and use pool, GC, or 502 with healthy pods guides.
Quick decision table
| Prod symptom | Likely first check |
|---|---|
| 401 / 403 | Token, IAM, WAF, IP allowlist |
| Connection timeout | DNS, SG, wrong host/port, dependency down |
| SSL errors | Truststore, cert chain, SNI hostname |
| 500 + NPE in logs | Prod-only null data; fix validation or migration |
| Works for you, not users | CDN, geo, A/B flag, canary pod version skew |
| Intermittent | One bad replica, sticky session, short stalls |
What to paste in the incident ticket
- Exact curl (secrets redacted), prod response, log stack trace
- Config keys involved (names only if secret)
- Whether staging reproduces (yes/no)
How — fix and prevent the local/prod gap
Immediate fix
- Correct config/secret in prod (with change control).
- Open network path or rotate credentials if auth/TLS.
- Data patch or feature flag off until code handles prod edge case.
- Roll back deploy if regression started at deploy time.
Long-term parity
- Staging ≈ prod — same IAM model, TLS, DB engine/version, feature flags; run smoke tests on every merge.
- 12-factor config — no prod-only magic in code; everything over env with documented keys.
- Contract tests — consumer-driven tests against partner sandboxes.
- Runbooks — “curl from pod” and “compare profiles” steps for on-call.
- Observability — structured logs + trace id on every request — see distributed trace.
CI smoke test (sketch)
# After deploy to staging ./scripts/smoke.sh --base-url $STAGING_URL --token $CI_TOKEN # Fails pipeline before prod promotion
Interview one-liner
“I capture the exact prod status and logs, replay the same request with curl, diff config and network from inside the pod, then check auth, TLS, and prod-only data—before I assume it’s a code bug. Staging parity and smoke tests prevent repeat incidents.”