Gateway 502/504 but Kubernetes pods are healthy

Scenario

Users see 502 Bad Gateway or 504 Gateway Timeout from the edge. kubectl get pods shows Running and readiness probes pass. The gap is between “process is up” and “the path through ingress → service → app can complete in time.” You debug by walking outward from the pod and inward from the client until the layer that fails is obvious.

After reading, you should be able to:

Explain why K8s “healthy” ≠ successful user requests.
Check endpoints, ingress, timeouts, and direct pod access.
Distinguish 502 (bad upstream) vs 504 (timeout) vs app 503.
Fix readiness probes, timeout chains, and pool exhaustion hiding behind fast /health.

Why — probes and user traffic are different paths

Kubernetes marks a pod Ready when the readiness probe succeeds—often a lightweight /actuator/health on a dedicated port. That path may not touch the database, thread pool, or code that serves POST /checkout. The gateway only routes to pods in the Endpoints list; if probes pass but app workers are stuck, you get healthy pods and failed customer traffic.

Request path layers

Client → CDN/WAF → Load balancer → Ingress controller → Service → Pod:containerPort → App thread pool → DB/API

A 502/504 is generated at the proxy layer when the upstream fails or does not respond in time—not by your Spring exception handler.

502 vs 504 (typical meanings)

Code	Often means
502	Upstream closed connection, invalid response, connection refused, reset
504	Upstream did not respond before proxy `proxy_read_timeout`
503 (app)	Service overloaded, often from app or ingress rate limit

Common causes with “green” pods

Gateway timeout < app latency — app still working; proxy gives 504.
Thread pool full — health check fast, API threads blocked — pool guide.
Wrong Service targetPort — traffic to port nothing listens on (intermittent if mixed config).
No endpoints — selector mismatch; all pods NotReady but Deployment shows desired count wrong.
Readiness too shallow — DB down but “UP” liveness only.
Connection reset — app killed request, OOM killer, pod restarting mid-request.
Ingress / mesh misconfig — wrong backend, max connections, HTTP vs HTTPS.
NetworkPolicy — ingress controller cannot reach pod port.
During deploy — old pods terminated while LB still draining — brief 502 burst.

What — investigate layer by layer

Confirm where the error is generated — response headers (via, server: nginx, cloud LB request id). App logs empty for that request → failed before app.
Gateway / ingress access logs
```
upstream_status, upstream_connect_time, upstream_response_time,
upstream_addr, status, request_time
```
upstream_status empty or - → could not connect. High upstream_response_time near timeout → 504.
Endpoints actually populated
```
kubectl get endpoints my-service -n prod
kubectl describe svc my-service
```
Empty subsets → no ready pods or selector wrong.
Bypass ingress: hit pod directly
```
kubectl port-forward pod/my-pod-abc 8080:8080
curl -sv http://127.0.0.1:8080/api/orders -H '...'
```
Works on port-forward but fails via ingress → ingress, Service, or NetworkPolicy.

Same path as users through Service

kubectl run curl --rm -it --image=curlimages/curl -- \
  curl -sv http://my-service.prod.svc.cluster.local/orders

Compare health vs business route — /health 200 while /orders hangs → deepen readiness; check pools and DB.
App logs + trace — if request id never appears, connection never reached servlet container. If appears then stalls → trace downstream.
Recent deploy / config — timeout lowered, probe path changed — deploy regression.
Pod events
```
kubectl describe pod my-pod-abc
```
OOMKilled, restarts, failed readiness flapping.

Timeout chain (must be ordered)

client_timeout ≥ ingress_timeout ≥ app_read_timeout ≥ DB_timeout

Bad: ingress 30s, app default 60s, DB 120s → ingress 504 while app still waiting

How — fix and harden the path

Fixes by finding

Finding	Fix
504, slow upstream_response_time	Fix app/DB slowness; or raise ingress timeout temporarily while fixing root cause
502, connection refused	Fix targetPort, container listen address (0.0.0.0), pod not ready
Health OK, API hung	Readiness includes DB; scale pool; timeouts — pool
Deploy blip	preStop sleep, graceful shutdown, PDB, slower roll
NetworkPolicy	Allow ingress namespace → app port

Stronger readiness (Spring example)

# readinessProbe checks dependency, not only ping
readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080
  periodSeconds: 5
livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080

Implement readiness to fail when DB pool cannot serve traffic—not when JVM process alone is up.

Graceful shutdown

server.shutdown=graceful + adequate spring.lifecycle.timeout-per-shutdown-phase.
preStop hook sleep so Endpoints remove pod before SIGTERM.
Stop accepting new work before kill.

Monitoring

Alert on ingress 5xx rate, not only pod Ready count.
Metric: endpoints_available vs deployment_replicas_desired.
Synthetic check through public URL (same path as users).

Verify

Public curl returns 200 for representative API.
Ingress 502/504 rate near zero under load test.
Readiness fails when DB down (pod removed from Service).

Interview one-liner

“Healthy pods only mean probes passed—I check ingress upstream logs, endpoints, curl via Service and port-forward, and whether /health is cheaper than real traffic. Usually it’s timeout mismatch, empty endpoints, or thread pool exhaustion while liveness still returns 200.”