Gateway 502/504 but Kubernetes pods are healthy
Scenario
Users see 502 Bad Gateway or 504 Gateway Timeout from the edge. kubectl get pods shows Running and readiness probes pass. The gap is between “process is up” and “the path through ingress → service → app can complete in time.” You debug by walking outward from the pod and inward from the client until the layer that fails is obvious.
After reading, you should be able to:
- Explain why K8s “healthy” ≠ successful user requests.
- Check endpoints, ingress, timeouts, and direct pod access.
- Distinguish 502 (bad upstream) vs 504 (timeout) vs app 503.
- Fix readiness probes, timeout chains, and pool exhaustion hiding behind fast
/health.
Why — probes and user traffic are different paths
Kubernetes marks a pod Ready when the readiness probe succeeds—often a lightweight /actuator/health on a dedicated port.
That path may not touch the database, thread pool, or code that serves POST /checkout.
The gateway only routes to pods in the Endpoints list; if probes pass but app workers are stuck, you get healthy pods and failed customer traffic.
Request path layers
Client → CDN/WAF → Load balancer → Ingress controller → Service → Pod:containerPort → App thread pool → DB/API
A 502/504 is generated at the proxy layer when the upstream fails or does not respond in time—not by your Spring exception handler.
502 vs 504 (typical meanings)
| Code | Often means |
|---|---|
| 502 | Upstream closed connection, invalid response, connection refused, reset |
| 504 | Upstream did not respond before proxy proxy_read_timeout |
| 503 (app) | Service overloaded, often from app or ingress rate limit |
Common causes with “green” pods
- Gateway timeout < app latency — app still working; proxy gives 504.
- Thread pool full — health check fast, API threads blocked — pool guide.
- Wrong Service targetPort — traffic to port nothing listens on (intermittent if mixed config).
- No endpoints — selector mismatch; all pods NotReady but Deployment shows desired count wrong.
- Readiness too shallow — DB down but “UP” liveness only.
- Connection reset — app killed request, OOM killer, pod restarting mid-request.
- Ingress / mesh misconfig — wrong backend, max connections, HTTP vs HTTPS.
- NetworkPolicy — ingress controller cannot reach pod port.
- During deploy — old pods terminated while LB still draining — brief 502 burst.
What — investigate layer by layer
-
Confirm where the error is generated
— response headers (
via,server: nginx, cloud LB request id). App logs empty for that request → failed before app. -
Gateway / ingress access logs
upstream_status, upstream_connect_time, upstream_response_time, upstream_addr, status, request_time
upstream_statusempty or-→ could not connect. Highupstream_response_timenear timeout → 504. -
Endpoints actually populated
kubectl get endpoints my-service -n prod kubectl describe svc my-service
Empty subsets → no ready pods or selector wrong. -
Bypass ingress: hit pod directly
kubectl port-forward pod/my-pod-abc 8080:8080 curl -sv http://127.0.0.1:8080/api/orders -H '...'
Works on port-forward but fails via ingress → ingress, Service, or NetworkPolicy. -
Same path as users through Service
kubectl run curl --rm -it --image=curlimages/curl -- \ curl -sv http://my-service.prod.svc.cluster.local/orders
-
Compare health vs business route
—
/health200 while/ordershangs → deepen readiness; check pools and DB. - App logs + trace — if request id never appears, connection never reached servlet container. If appears then stalls → trace downstream.
- Recent deploy / config — timeout lowered, probe path changed — deploy regression.
-
Pod events
kubectl describe pod my-pod-abc
OOMKilled, restarts, failed readiness flapping.
Timeout chain (must be ordered)
client_timeout ≥ ingress_timeout ≥ app_read_timeout ≥ DB_timeout Bad: ingress 30s, app default 60s, DB 120s → ingress 504 while app still waiting
How — fix and harden the path
Fixes by finding
| Finding | Fix |
|---|---|
| 504, slow upstream_response_time | Fix app/DB slowness; or raise ingress timeout temporarily while fixing root cause |
| 502, connection refused | Fix targetPort, container listen address (0.0.0.0), pod not ready |
| Health OK, API hung | Readiness includes DB; scale pool; timeouts — pool |
| Deploy blip | preStop sleep, graceful shutdown, PDB, slower roll |
| NetworkPolicy | Allow ingress namespace → app port |
Stronger readiness (Spring example)
# readinessProbe checks dependency, not only ping
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
periodSeconds: 5
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
Implement readiness to fail when DB pool cannot serve traffic—not when JVM process alone is up.
Graceful shutdown
server.shutdown=graceful+ adequatespring.lifecycle.timeout-per-shutdown-phase.preStophook sleep so Endpoints remove pod before SIGTERM.- Stop accepting new work before kill.
Monitoring
- Alert on ingress 5xx rate, not only pod Ready count.
- Metric:
endpoints_availablevsdeployment_replicas_desired. - Synthetic check through public URL (same path as users).
Verify
- Public curl returns 200 for representative API.
- Ingress 502/504 rate near zero under load test.
- Readiness fails when DB down (pod removed from Service).
Interview one-liner
“Healthy pods only mean probes passed—I check ingress upstream logs, endpoints, curl via Service and port-forward, and whether /health is cheaper than real traffic. Usually it’s timeout mismatch, empty endpoints, or thread pool exhaustion while liveness still returns 200.”