Gateway 502/504 but Kubernetes pods are healthy

Scenario

Users see 502 Bad Gateway or 504 Gateway Timeout from the edge. kubectl get pods shows Running and readiness probes pass. The gap is between “process is up” and “the path through ingress → service → app can complete in time.” You debug by walking outward from the pod and inward from the client until the layer that fails is obvious.

After reading, you should be able to:

Why — probes and user traffic are different paths

Kubernetes marks a pod Ready when the readiness probe succeeds—often a lightweight /actuator/health on a dedicated port. That path may not touch the database, thread pool, or code that serves POST /checkout. The gateway only routes to pods in the Endpoints list; if probes pass but app workers are stuck, you get healthy pods and failed customer traffic.

Request path layers

Client → CDN/WAF → Load balancer → Ingress controller → Service → Pod:containerPort → App thread pool → DB/API

A 502/504 is generated at the proxy layer when the upstream fails or does not respond in time—not by your Spring exception handler.

502 vs 504 (typical meanings)

CodeOften means
502Upstream closed connection, invalid response, connection refused, reset
504Upstream did not respond before proxy proxy_read_timeout
503 (app)Service overloaded, often from app or ingress rate limit

Common causes with “green” pods

What — investigate layer by layer

  1. Confirm where the error is generated — response headers (via, server: nginx, cloud LB request id). App logs empty for that request → failed before app.
  2. Gateway / ingress access logs
    upstream_status, upstream_connect_time, upstream_response_time,
    upstream_addr, status, request_time
    upstream_status empty or - → could not connect. High upstream_response_time near timeout → 504.
  3. Endpoints actually populated
    kubectl get endpoints my-service -n prod
    kubectl describe svc my-service
    Empty subsets → no ready pods or selector wrong.
  4. Bypass ingress: hit pod directly
    kubectl port-forward pod/my-pod-abc 8080:8080
    curl -sv http://127.0.0.1:8080/api/orders -H '...'
    Works on port-forward but fails via ingress → ingress, Service, or NetworkPolicy.
  5. Same path as users through Service
    kubectl run curl --rm -it --image=curlimages/curl -- \
      curl -sv http://my-service.prod.svc.cluster.local/orders
  6. Compare health vs business route/health 200 while /orders hangs → deepen readiness; check pools and DB.
  7. App logs + trace — if request id never appears, connection never reached servlet container. If appears then stalls → trace downstream.
  8. Recent deploy / config — timeout lowered, probe path changed — deploy regression.
  9. Pod events
    kubectl describe pod my-pod-abc
    OOMKilled, restarts, failed readiness flapping.

Timeout chain (must be ordered)

client_timeout ≥ ingress_timeout ≥ app_read_timeout ≥ DB_timeout

Bad: ingress 30s, app default 60s, DB 120s → ingress 504 while app still waiting

How — fix and harden the path

Fixes by finding

FindingFix
504, slow upstream_response_timeFix app/DB slowness; or raise ingress timeout temporarily while fixing root cause
502, connection refusedFix targetPort, container listen address (0.0.0.0), pod not ready
Health OK, API hungReadiness includes DB; scale pool; timeouts — pool
Deploy blippreStop sleep, graceful shutdown, PDB, slower roll
NetworkPolicyAllow ingress namespace → app port

Stronger readiness (Spring example)

# readinessProbe checks dependency, not only ping
readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080
  periodSeconds: 5
livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080

Implement readiness to fail when DB pool cannot serve traffic—not when JVM process alone is up.

Graceful shutdown

Monitoring

Verify

  1. Public curl returns 200 for representative API.
  2. Ingress 502/504 rate near zero under load test.
  3. Readiness fails when DB down (pod removed from Service).

Interview one-liner

“Healthy pods only mean probes passed—I check ingress upstream logs, endpoints, curl via Service and port-forward, and whether /health is cheaper than real traffic. Usually it’s timeout mismatch, empty endpoints, or thread pool exhaustion while liveness still returns 200.”

Related scenarios