API works locally but fails in production

Scenario

Developers verify the endpoint on localhost—200 OK. In production you see 401, 403, 500, timeouts, or silent wrong data. Reproducing locally fails. The gap is almost never “Java is different”; it is environment, configuration, network, data, and dependencies that differ between laptop and prod.

After reading, you should be able to:

Run a ordered first-pass checklist before deep JVM debugging.
Compare the same request against prod and local (status, headers, body, logs).
Map failures to config, network/TLS, auth/IAM, data, and missing downstream services.
Shrink the gap with staging parity and smoke tests in CI.

Why — local is not a small copy of prod

Local development optimizes for speed: embedded DB, mocked payments, admin credentials, HTTP not HTTPS, no load balancer. Production adds real policies—secrets, firewalls, WAF, IAM roles, production data shape, and multi-hop networking. The same code path can succeed locally and fail in prod because inputs to the path changed, not because the compiler behaved differently.

Top mismatch categories

Category	Local typical	Production typical
Configuration	`application-dev.yml`, defaults	Env vars, ConfigMaps, secrets, feature flags
Networking	`localhost`, no TLS	DNS, mTLS, security groups, egress proxy
Auth	Disabled or test JWT	OAuth, IAM role, API keys, IP allowlist
Data	Empty or seed data	Volume, edge cases, time zones, charset
Dependencies	Mocks / Docker only you run	Real SQS, Kafka ACLs, partner sandbox vs prod
Runtime	IDE classpath, more memory	Container image, JDK flags, read-only filesystem
Scale	One user, one pod	Load balancer, multiple replicas, races

Symptom ≠ root cause. A 500 in prod may be a downstream 403 masked by a generic catch block. Always read prod logs and the actual exception before changing code.

What — investigate in this order

Capture the failing prod fact — HTTP status, response body, request id, timestamp, pod name, trace id. Without this you are guessing.

Reproduce with the same client call

curl -sv -X POST 'https://api.prod.example/orders' \
  -H 'Authorization: Bearer …' \
  -H 'Content-Type: application/json' \
  -d @payload.json

Compare to local URL, headers, and body byte-for-byte (including trailing slashes, query order).

Find the log line in prod for that request id — stack trace, “connection refused,” “Access Denied,” “SSL handshake failure,” validation error.
Config diff (most common fix)
- Active Spring profile: SPRING_PROFILES_ACTIVE
- Missing env var (empty string vs unset)
- Wrong JDBC URL (host, SSL mode, schema)
- Feature flag off in prod
- Clock/skew: token “not yet valid”

Network path from pod

kubectl exec -it <pod> -- curl -sv https://partner-api/health
kubectl exec -it <pod> -- nslookup db.prod.internal

Fails in pod but works on laptop → SG, NACL, egress, private link, DNS.

TLS / trust — corporate CA not in container truststore; cert hostname mismatch; TLS 1.2 required by partner.
Auth & IAM — pod service account lacks S3/SQS permission; OAuth client secret rotated; prod API key scope.
Data-dependent logic — null FK in prod, duplicate unique key, legacy row breaks new validation; timezone “today” differs.
Container vs IDE differences — case-sensitive paths on Linux image; file not in JAR; profile-specific bean missing in prod build.
Only fails under load? — then leave “works locally” and use pool, GC, or 502 with healthy pods guides.

Quick decision table

Prod symptom	Likely first check
401 / 403	Token, IAM, WAF, IP allowlist
Connection timeout	DNS, SG, wrong host/port, dependency down
SSL errors	Truststore, cert chain, SNI hostname
500 + NPE in logs	Prod-only null data; fix validation or migration
Works for you, not users	CDN, geo, A/B flag, canary pod version skew
Intermittent	One bad replica, sticky session, short stalls

What to paste in the incident ticket

Exact curl (secrets redacted), prod response, log stack trace
Config keys involved (names only if secret)
Whether staging reproduces (yes/no)

How — fix and prevent the local/prod gap

Immediate fix

Correct config/secret in prod (with change control).
Open network path or rotate credentials if auth/TLS.
Data patch or feature flag off until code handles prod edge case.
Roll back deploy if regression started at deploy time.

Long-term parity

Staging ≈ prod — same IAM model, TLS, DB engine/version, feature flags; run smoke tests on every merge.
12-factor config — no prod-only magic in code; everything over env with documented keys.
Contract tests — consumer-driven tests against partner sandboxes.
Runbooks — “curl from pod” and “compare profiles” steps for on-call.
Observability — structured logs + trace id on every request — see distributed trace.

CI smoke test (sketch)

# After deploy to staging
./scripts/smoke.sh --base-url $STAGING_URL --token $CI_TOKEN
# Fails pipeline before prod promotion

Interview one-liner

“I capture the exact prod status and logs, replay the same request with curl, diff config and network from inside the pod, then check auth, TLS, and prod-only data—before I assume it’s a code bug. Staging parity and smoke tests prevent repeat incidents.”

Why — local is not a small copy of prod

Top mismatch categories

What — investigate in this order

Quick decision table

What to paste in the incident ticket

How — fix and prevent the local/prod gap

Immediate fix

Long-term parity

CI smoke test (sketch)

Interview one-liner

Related scenarios