Application crashes without clear errors
Scenario
The process or pod disappears: Kubernetes restarts it, users see 502s, but application logs show no stack trace—maybe a line about “shutdown” or nothing at all. Exit code 137 or 143 shows up in events. How do you find what killed it?
After reading, you should be able to:
- Map exit codes and K8s
Reasonto kill mechanism. - Find evidence outside app logs: events,
hs_err_pid, node OOM killer, probes. - Distinguish cgroup OOMKilled from in-JVM
OutOfMemoryError. - Fix sizing, probes, and native-memory issues with verification.
Related: OutOfMemoryError (when the JVM logs before dying), slow after hours (often precedes silent kill).
Why — silent exits are usually “killed from outside”
A normal Java exception prints to stderr/log and the JVM keeps running (or unwinds one thread). A silent crash means the OS or orchestrator terminated the process, or the JVM hit a fatal error that bypassed your logging pipeline. Your log appender never flushed—or the killer did not give the JVM time to log.
Exit codes and meanings
| Exit code | Signal | Typical cause |
|---|---|---|
| 137 | SIGKILL (128+9) | Kubernetes OOMKilled (cgroup memory limit), node OOM killer, manual kubectl delete --force |
| 143 | SIGTERM (128+15) | Graceful pod termination, deploy rollout, preStop hook timeout then kill |
| 1 | — | Uncaught exception on main thread (usually has stack trace) |
| 134 | SIGABRT | JVM abort, failed assertion, some native library crashes |
| Non-zero, not 137 | — | JVM fatal error → check hs_err_pid*.log |
Top causes in production
- Container memory limit exceeded —
-Xmx+ metaspace + threads + direct memory + JVM overhead > cgroup limit. Linux sends SIGKILL. No Java OOM in logs. See OOM guide (K8s section). -XX:+ExitOnOutOfMemoryError— JVM exits quickly on OOM; you might only see one OOM line—or lose it if log buffer not flushed.- Liveness probe failure loop — kubelet kills “unhealthy” container after
failureThreshold; can look like random restarts under GC pauses. - Node memory pressure — kubelet evicts pods; reason
Evicted/ OOM on node, not your app log. - JVM fatal error — writes
hs_err_pid<pid>.log(SIGSEGV in JNI, corrupted metaspace, bug in JDK). May not reach application logger. - StackOverflowError — deep recursion; sometimes logged, sometimes fatal depending on thread/state.
- Native OOM —
mallocfails outside Java heap; process killed or aborts. - PID / thread limit —
unable to create native threadthen unstable state; cgrouppids.maxhit.
137 is not a Java bug. It means something external sent SIGKILL. Your job is to find who—cgroup, node, or human—and why memory or health failed.
What — investigate outside the app log (in order)
-
Kubernetes: previous container state
kubectl describe pod <pod> -n <ns> # Last State: Terminated, Reason: OOMKilled, Exit Code: 137 kubectl get pod <pod> -o jsonpath='{.status.containerStatuses[0].lastState}' -
Restart count and timing
RESTARTScolumn inkubectl get pods; correlate with deploys, traffic, cron. -
Container memory vs limit (metrics)
Memory working set at limit before death? Prometheus:
container_memory_working_set_bytes≈ limit. -
Compare
-Xmxto cgroup limitkubectl exec <pod> -- java -XX:+PrintFlagsFinal -version | grep -i heapsize # limit must exceed Xmx + headroom (~25–40%)
-
Events: liveness / readiness / eviction
Liveness probe failed,Evicted,NodeHasInsufficientMemory. -
Find JVM
hs_err_pidfile In container/tmpor volume mount; host/var/logif configured.-XX:ErrorFile=/dumps/hs_err_%p.log -
Previous pod logs (short window)
kubectl logs <pod> --previous --tail=200
Last lines may show OOM, GC overhead, or probe timeout—not always in current log stream. -
Node-level OOM (if not OOMKilled on container)
On node (if permitted):
dmesg | grep -i 'killed process'— Linux OOM killer at host level.
Liveness probe false positives
If Reason is Killing with “Container failed liveness probe”:
- GC pause > probe
timeoutSeconds→ increase timeout or use startup probe during warm-up. - Probe hits dependency (DB) that is slow → probe should check app heartbeat only, not full stack.
failureThreshold: 3× 10s = dead in 30s of slow responses.
What to collect for the ticket
Pod name / node: Exit code & reason: Container memory limit & peak usage: -Xmx -Xms MaxMetaspaceSize: Restart count in last 24h: Last 50 lines kubectl logs --previous: hs_err_pid present (Y/N): Recent deploy SHA: Liveness/readiness probe config:
How — fix, harden, alert
Fix OOMKilled (most common)
- Lower
-Xmxor raise container memory limit—never equal limit with no headroom. - Fix memory leak if usage climbs over hours (slow after hours).
- Cap direct memory if Netty/gRPC:
-XX:MaxDirectMemorySize. - Validate with soak test; watch working set stay below ~85% of limit.
resources:
limits:
memory: "3Gi"
env:
JAVA_TOOL_OPTIONS: "-Xms2g -Xmx2g -XX:MaxMetaspaceSize=256m"
Fix liveness-kill loop
- Add startupProbe with generous
failureThresholdduring warm-up. - Liveness: lightweight endpoint (e.g.
/health/live) that does not query DB. - Align timeouts with GC pause p99 + margin.
Fix JVM fatal errors
- Read
hs_err_pid— faulting frame in JDK vs your JNI library. - Upgrade JDK patch release; reproduce with native symbols if custom JNI.
- Disable risky agents (old APM native agents) as isolation test.
Logging and dumps on fatal exit
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/dumps -XX:+ExitOnOutOfMemoryError -XX:ErrorFile=/dumps/hs_err_%p.log -XX:+OmitStackTraceInFastThrow # avoid hiding repeated stack traces in logs
Mount /dumps on emptyDir with size limit; sidecar or job copies files before pod deleted.
Alerts
- Kube-state-metrics: pod restart rate > N per hour.
- Alert on
OOMKilledreason in events (event exporter). - Memory working set > 90% of limit for 5 min.
Interview one-liner
“Exit 137 means SIGKILL—usually cgroup OOMKilled, not a missing log line. I check describe pod, memory limit vs peak, logs --previous, and hs_err_pid; then fix heap sizing or leak, or fix liveness probes killing the container during long GC.”
Related scenarios
- OutOfMemoryError — when JVM logs before exit.
- Memory leak — leads to OOMKilled without Java stack.
- RSS grows, heap stable — native memory and NMT.
- Kubernetes debugging — events and probes.