Application crashes without clear errors

Scenario

The process or pod disappears: Kubernetes restarts it, users see 502s, but application logs show no stack trace—maybe a line about “shutdown” or nothing at all. Exit code 137 or 143 shows up in events. How do you find what killed it?

After reading, you should be able to:

Map exit codes and K8s Reason to kill mechanism.
Find evidence outside app logs: events, hs_err_pid, node OOM killer, probes.
Distinguish cgroup OOMKilled from in-JVM OutOfMemoryError.
Fix sizing, probes, and native-memory issues with verification.

Related: OutOfMemoryError (when the JVM logs before dying), slow after hours (often precedes silent kill).

Why — silent exits are usually “killed from outside”

A normal Java exception prints to stderr/log and the JVM keeps running (or unwinds one thread). A silent crash means the OS or orchestrator terminated the process, or the JVM hit a fatal error that bypassed your logging pipeline. Your log appender never flushed—or the killer did not give the JVM time to log.

Exit codes and meanings

Exit code	Signal	Typical cause
137	SIGKILL (128+9)	Kubernetes OOMKilled (cgroup memory limit), node OOM killer, manual `kubectl delete --force`
143	SIGTERM (128+15)	Graceful pod termination, deploy rollout, preStop hook timeout then kill
1	—	Uncaught exception on main thread (usually has stack trace)
134	SIGABRT	JVM abort, failed assertion, some native library crashes
Non-zero, not 137	—	JVM fatal error → check `hs_err_pid*.log`

Top causes in production

Container memory limit exceeded — -Xmx + metaspace + threads + direct memory + JVM overhead > cgroup limit. Linux sends SIGKILL. No Java OOM in logs. See OOM guide (K8s section).
-XX:+ExitOnOutOfMemoryError — JVM exits quickly on OOM; you might only see one OOM line—or lose it if log buffer not flushed.
Liveness probe failure loop — kubelet kills “unhealthy” container after failureThreshold; can look like random restarts under GC pauses.
Node memory pressure — kubelet evicts pods; reason Evicted / OOM on node, not your app log.
JVM fatal error — writes hs_err_pid<pid>.log (SIGSEGV in JNI, corrupted metaspace, bug in JDK). May not reach application logger.
StackOverflowError — deep recursion; sometimes logged, sometimes fatal depending on thread/state.
Native OOM — malloc fails outside Java heap; process killed or aborts.
PID / thread limit — unable to create native thread then unstable state; cgroup pids.max hit.

137 is not a Java bug. It means something external sent SIGKILL. Your job is to find who—cgroup, node, or human—and why memory or health failed.

What — investigate outside the app log (in order)

Kubernetes: previous container state

kubectl describe pod <pod> -n <ns>
# Last State: Terminated, Reason: OOMKilled, Exit Code: 137
kubectl get pod <pod> -o jsonpath='{.status.containerStatuses[0].lastState}'

Restart count and timing RESTARTS column in kubectl get pods; correlate with deploys, traffic, cron.
Container memory vs limit (metrics) Memory working set at limit before death? Prometheus: container_memory_working_set_bytes ≈ limit.

Compare -Xmx to cgroup limit

kubectl exec <pod> -- java -XX:+PrintFlagsFinal -version | grep -i heapsize
# limit must exceed Xmx + headroom (~25–40%)

Events: liveness / readiness / eviction Liveness probe failed, Evicted, NodeHasInsufficientMemory.
Find JVM hs_err_pid file In container /tmp or volume mount; host /var/log if configured. -XX:ErrorFile=/dumps/hs_err_%p.log
Previous pod logs (short window)
```
kubectl logs <pod> --previous --tail=200
```
Last lines may show OOM, GC overhead, or probe timeout—not always in current log stream.
Node-level OOM (if not OOMKilled on container) On node (if permitted): dmesg | grep -i 'killed process' — Linux OOM killer at host level.

Liveness probe false positives

If Reason is Killing with “Container failed liveness probe”:

GC pause > probe timeoutSeconds → increase timeout or use startup probe during warm-up.
Probe hits dependency (DB) that is slow → probe should check app heartbeat only, not full stack.
failureThreshold: 3 × 10s = dead in 30s of slow responses.

What to collect for the ticket

Pod name / node:
Exit code & reason:
Container memory limit & peak usage:
-Xmx -Xms MaxMetaspaceSize:
Restart count in last 24h:
Last 50 lines kubectl logs --previous:
hs_err_pid present (Y/N):
Recent deploy SHA:
Liveness/readiness probe config:

How — fix, harden, alert

Fix OOMKilled (most common)

Lower -Xmx or raise container memory limit—never equal limit with no headroom.
Fix memory leak if usage climbs over hours (slow after hours).
Cap direct memory if Netty/gRPC: -XX:MaxDirectMemorySize.
Validate with soak test; watch working set stay below ~85% of limit.

resources:
  limits:
    memory: "3Gi"
env:
  JAVA_TOOL_OPTIONS: "-Xms2g -Xmx2g -XX:MaxMetaspaceSize=256m"

Fix liveness-kill loop

Add startupProbe with generous failureThreshold during warm-up.
Liveness: lightweight endpoint (e.g. /health/live) that does not query DB.
Align timeouts with GC pause p99 + margin.

Fix JVM fatal errors

Read hs_err_pid — faulting frame in JDK vs your JNI library.
Upgrade JDK patch release; reproduce with native symbols if custom JNI.
Disable risky agents (old APM native agents) as isolation test.

Logging and dumps on fatal exit

-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/dumps
-XX:+ExitOnOutOfMemoryError
-XX:ErrorFile=/dumps/hs_err_%p.log
-XX:+OmitStackTraceInFastThrow   # avoid hiding repeated stack traces in logs

Mount /dumps on emptyDir with size limit; sidecar or job copies files before pod deleted.

Alerts

Kube-state-metrics: pod restart rate > N per hour.
Alert on OOMKilled reason in events (event exporter).
Memory working set > 90% of limit for 5 min.

Interview one-liner

“Exit 137 means SIGKILL—usually cgroup OOMKilled, not a missing log line. I check describe pod, memory limit vs peak, logs --previous, and hs_err_pid; then fix heap sizing or leak, or fix liveness probes killing the container during long GC.”

Related scenarios

OutOfMemoryError — when JVM logs before exit.
Memory leak — leads to OOMKilled without Java stack.
RSS grows, heap stable — native memory and NMT.
Kubernetes debugging — events and probes.