Application crashes without clear errors

Scenario

The process or pod disappears: Kubernetes restarts it, users see 502s, but application logs show no stack trace—maybe a line about “shutdown” or nothing at all. Exit code 137 or 143 shows up in events. How do you find what killed it?

After reading, you should be able to:

Related: OutOfMemoryError (when the JVM logs before dying), slow after hours (often precedes silent kill).

Why — silent exits are usually “killed from outside”

A normal Java exception prints to stderr/log and the JVM keeps running (or unwinds one thread). A silent crash means the OS or orchestrator terminated the process, or the JVM hit a fatal error that bypassed your logging pipeline. Your log appender never flushed—or the killer did not give the JVM time to log.

Exit codes and meanings

Exit codeSignalTypical cause
137SIGKILL (128+9)Kubernetes OOMKilled (cgroup memory limit), node OOM killer, manual kubectl delete --force
143SIGTERM (128+15)Graceful pod termination, deploy rollout, preStop hook timeout then kill
1Uncaught exception on main thread (usually has stack trace)
134SIGABRTJVM abort, failed assertion, some native library crashes
Non-zero, not 137JVM fatal error → check hs_err_pid*.log

Top causes in production

137 is not a Java bug. It means something external sent SIGKILL. Your job is to find who—cgroup, node, or human—and why memory or health failed.

What — investigate outside the app log (in order)

  1. Kubernetes: previous container state
    kubectl describe pod <pod> -n <ns>
    # Last State: Terminated, Reason: OOMKilled, Exit Code: 137
    kubectl get pod <pod> -o jsonpath='{.status.containerStatuses[0].lastState}'
  2. Restart count and timing RESTARTS column in kubectl get pods; correlate with deploys, traffic, cron.
  3. Container memory vs limit (metrics) Memory working set at limit before death? Prometheus: container_memory_working_set_bytes ≈ limit.
  4. Compare -Xmx to cgroup limit
    kubectl exec <pod> -- java -XX:+PrintFlagsFinal -version | grep -i heapsize
    # limit must exceed Xmx + headroom (~25–40%)
  5. Events: liveness / readiness / eviction Liveness probe failed, Evicted, NodeHasInsufficientMemory.
  6. Find JVM hs_err_pid file In container /tmp or volume mount; host /var/log if configured. -XX:ErrorFile=/dumps/hs_err_%p.log
  7. Previous pod logs (short window)
    kubectl logs <pod> --previous --tail=200
    Last lines may show OOM, GC overhead, or probe timeout—not always in current log stream.
  8. Node-level OOM (if not OOMKilled on container) On node (if permitted): dmesg | grep -i 'killed process' — Linux OOM killer at host level.

Liveness probe false positives

If Reason is Killing with “Container failed liveness probe”:

What to collect for the ticket

Pod name / node:
Exit code & reason:
Container memory limit & peak usage:
-Xmx -Xms MaxMetaspaceSize:
Restart count in last 24h:
Last 50 lines kubectl logs --previous:
hs_err_pid present (Y/N):
Recent deploy SHA:
Liveness/readiness probe config:

How — fix, harden, alert

Fix OOMKilled (most common)

  1. Lower -Xmx or raise container memory limit—never equal limit with no headroom.
  2. Fix memory leak if usage climbs over hours (slow after hours).
  3. Cap direct memory if Netty/gRPC: -XX:MaxDirectMemorySize.
  4. Validate with soak test; watch working set stay below ~85% of limit.
resources:
  limits:
    memory: "3Gi"
env:
  JAVA_TOOL_OPTIONS: "-Xms2g -Xmx2g -XX:MaxMetaspaceSize=256m"

Fix liveness-kill loop

Fix JVM fatal errors

Logging and dumps on fatal exit

-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/dumps
-XX:+ExitOnOutOfMemoryError
-XX:ErrorFile=/dumps/hs_err_%p.log
-XX:+OmitStackTraceInFastThrow   # avoid hiding repeated stack traces in logs

Mount /dumps on emptyDir with size limit; sidecar or job copies files before pod deleted.

Alerts

Interview one-liner

“Exit 137 means SIGKILL—usually cgroup OOMKilled, not a missing log line. I check describe pod, memory limit vs peak, logs --previous, and hs_err_pid; then fix heap sizing or leak, or fix liveness probes killing the container during long GC.”

Related scenarios