Container Runtime & Lifecycle
A container is a running instance of an image—a namespaced, cgroup-bounded process tree managed by containerd and runc. Every production incident eventually lands here: wrong docker run flags, PID 1 signal handling, missing resource limits, or debugging blind spots. Master the lifecycle and the CLI becomes a precision instrument, not a black box.
Container lifecycle
A container is not a VM—it is a stateful object in Docker's metadata store backed by a live (or frozen) process tree. Understanding the state machine is the foundation for graceful shutdowns, restart policies, and why docker create exists separately from docker run.
State machine
Docker tracks each container through a well-defined lifecycle. The writable container layer (the "upperdir" in OverlayFS) persists until the container is removed—even after stop. Paused containers freeze all processes via cgroup freezer; stopped containers have exited but retain filesystem and metadata.
stateDiagram-v2 [*] --> created: docker create created --> running: docker start / docker run running --> paused: docker pause paused --> running: docker unpause running --> stopped: exit / docker stop / docker kill paused --> stopped: docker kill stopped --> running: docker start stopped --> removed: docker rm running --> removed: docker rm -f created --> removed: docker rm removed --> [*]
| State | Process tree | Writable layer | Typical next action |
|---|---|---|---|
| created | Not started; bundle prepared by containerd | Allocated, empty | docker start |
| running | Active; PID 1 (or init) executing | Accepting writes | docker stop, docker exec |
| paused | Frozen (cgroup freezer) | Preserved | docker unpause |
| stopped | Exited; exit code recorded | Preserved (diff vs image) | docker start, docker rm |
| removed | Gone | Deleted (unless volume) | — |
docker create vs docker run
docker create builds the OCI bundle—namespaces, cgroups, mounts, env, network endpoint—but does
not call runc start. The container sits in created state until
docker start. This split is useful for pre-provisioning containers, attaching to custom networks
before start, or orchestrators that separate "define" from "launch."
docker run is create + start in one command. With -d it returns immediately after backgrounding; without -d it attaches to stdout/stderr and blocks until the container exits.
# Create without starting — inspect config before launch
docker create --name api-cache -e REDIS_PASSWORD=secret redis:7-alpine
docker inspect api-cache --format '{{.State.Status}}' # created
# Start when ready
docker start api-cache
# Equivalent one-liner
docker run -d --name api-cache -e REDIS_PASSWORD=secret redis:7-alpine
Start, stop, and kill
| Command | Signal | Behavior | Use when |
|---|---|---|---|
| docker stop | SIGTERM → wait → SIGKILL | Graceful shutdown; default 10s timeout (-t) | Normal shutdown; apps handle SIGTERM |
| docker kill | SIGKILL (default) or custom | Immediate; no cleanup hooks | Stuck container, runaway process |
| docker start | — | Re-runs container with same ID, layer, config | Restart after stop (not a fresh create) |
| docker restart | stop + start | Same container ID; processes re-exec entrypoint | Config reload that needs full process restart |
docker stop sends SIGTERM to PID 1 in the container's PID namespace—not necessarily the host PID you see in docker top. containerd waits for the stop timeout, then escalates to SIGKILL. If PID 1 ignores SIGTERM (shell-form CMD trap), you'll wait the full timeout every deploy.
Pause and unpause
docker pause freezes every process in the container using the cgroup freezer controller. Memory stays allocated; network connections may time out on peers. Rare in app deploys, but useful for consistent filesystem snapshots or debugging race conditions without tearing down the process tree.
Remove (docker rm)
Removing a stopped container deletes its writable layer and frees the name. Volumes declared with -v on docker rm also delete anonymous volumes attached to that container. Named volumes survive unless explicitly removed with docker volume rm. docker rm -f kills a running container first—dangerous in production without orchestrator coordination.
Restart policies (--restart)
Restart policies tell the Docker daemon whether to automatically restart a container when it exits or when the daemon itself restarts. Set at docker run or docker update; stored in container config.
| Policy | On exit | On daemon restart | Typical use |
|---|---|---|---|
| no (default) | Stays stopped | Stays stopped | CI jobs, one-off tasks, docker run --rm |
| always | Always restart (any exit code) | Restarts running containers | Long-lived daemons; beware restart loops on crash |
| on-failure | Restart only on non-zero exit | Restarts if was running | Batch workers; optional :max-retries |
| unless-stopped | Restart unless manually stopped | Restarts unless you explicitly stopped it | Production single-host; survives reboot |
# Restart up to 3 times on failure only
docker run -d --restart on-failure:3 --name worker myapp:latest
# Survive host reboot — won't restart if you docker stop it manually
docker run -d --restart unless-stopped --name edge-proxy nginx:alpine
# Check restart count after crash loop
docker inspect --format '{{.RestartCount}}' worker
Restart loop masking bugs: --restart always on a container that exits immediately (bad config, missing env var) creates a tight crash loop. Disk fills with logs, CPU churns, and the real error scrolls past. Fix the root cause; use on-failure:N caps and alert on RestartCount.
"Difference between docker stop and docker kill?" — stop = SIGTERM with grace period for cleanup; kill = immediate SIGKILL (or custom signal). Follow up: "What receives SIGTERM?" → PID 1 in the container namespace; hence the PID 1 problem for shell-form entrypoints.
Running containers
docker run is the most powerful command in the Docker CLI—every flag maps to OCI runtime spec fields, cgroup writes, or network namespace setup. Treat this table as your field reference; production correctness lives in the flags you don't forget.
docker run flags reference
| Flag | Purpose | Example | Internals note |
|---|---|---|---|
| -d | Detach; run in background | docker run -d nginx | containerd starts process; CLI returns container ID |
| -i | Keep STDIN open | docker run -i alpine cat | Interactive input without TTY allocation |
| -t | Allocate pseudo-TTY | docker run -it bash | Sets TERM; required for shells, vim |
| --name | Human-readable name (unique) | --name api-gateway | DNS on user-defined networks resolves by name |
| --rm | Auto-remove on exit | docker run --rm alpine echo hi | Deletes writable layer; great for CI, bad for post-mortem |
| -p | Publish port (host:container) | -p 8080:80, -p 127.0.0.1:8080:80 | iptables DNAT via docker-proxy or hairpin NAT |
| -P | Publish all EXPOSE'd ports to random host ports | docker run -P nginx | Reads image config ExposedPorts |
| -v | Bind mount or named volume (legacy syntax) | -v data:/var/lib/mysql | Direct mount into MNT namespace |
| --mount | Explicit mount (preferred) | --mount type=bind,src=/host,dst=/app,ro | Clearer semantics: bind, volume, tmpfs |
| -e | Set environment variable | -e JAVA_OPTS=-Xmx512m | Injected into OCI process env block |
| --env-file | Load env vars from file | --env-file .env.prod | KEY=VALUE lines; supports # comments |
| --network | Attach to network | --network backend | Creates veth pair, assigns IP on bridge/overlay |
| --network-alias | Extra DNS names on network | --network-alias db | Embedded DNS (127.0.0.11) resolves aliases |
| --user | Run as UID:GID or name | --user 1001:1001 | USER namespace mapping in rootless mode |
| --read-only | Root filesystem read-only | --read-only | Upperdir still exists but mounted ro; pair with tmpfs for writes |
| --tmpfs | RAM-backed writable mount | --tmpfs /tmp:rw,noexec,nosuid,size=64m | Ephemeral; gone when container removed |
| --security-opt | Security options | --security-opt no-new-privileges | Sets prctl PR_SET_NO_NEW_PRIVS; blocks setuid escalation |
| --cap-drop / --cap-add | Linux capabilities | --cap-drop ALL --cap-add NET_BIND_SERVICE | Default Docker drops many caps; never add SYS_ADMIN casually |
| --memory | Hard memory limit | --memory 512m | Writes memory.max in cgroup v2 |
| --cpus | CPU quota (cores as float) | --cpus 1.5 | cpu.max = quota/period (default period 100ms) |
| --pids-limit | Max processes in container | --pids-limit 200 | Prevents fork bombs; default -1 (unlimited) on some hosts |
Production-shaped example
$ docker run -d \ --name payments-api \ --restart unless-stopped \ --network backend \ --network-alias payments \ -p 127.0.0.1:8080:8080 \ -e SPRING_PROFILES_ACTIVE=prod \ --env-file /etc/payments.env \ --user 10001:10001 \ --read-only \ --tmpfs /tmp:rw,noexec,nosuid,size=128m \ --security-opt no-new-privileges \ --cap-drop ALL \ --memory 1g \ --cpus 2 \ --pids-limit 512 \ --init \ myregistry/payments-api:sha256-abc123 f3a8c2d91e04b7...
Default-deny capabilities: Start with --cap-drop ALL and add only what the app needs (typically NET_BIND_SERVICE if binding <1024). Combine with --read-only, no-new-privileges, and non-root --user. Mounting /var/run/docker.sock effectively grants host root—treat it as a critical CVE.
Prefer --mount over -v for clarity: type=bind,src=/data,dst=/data,readonly is unambiguous. For named volumes: --mount type=volume,source=pgdata,target=/var/lib/postgresql/data. Bind mounts to host paths break portability—use named volumes or external storage drivers in prod.
-p 0.0.0.0:8080:8080 vs 127.0.0.1:8080:8080: Binding all interfaces exposes the port on every host NIC—including public ones. Loopback-only is safer for dev; production should use a reverse proxy or ingress controller, not raw Docker port publish on 0.0.0.0.
PID 1 problem
Inside a container, your main process is PID 1. The Linux kernel treats PID 1 differently: it ignores default signal dispositions, must reap zombie children, and receives shutdown signals from Docker during docker stop. Get PID 1 wrong and graceful deploys become hard kills.
Why PID 1 is special
- Signal handling — PID 1 ignores SIGTERM/SIGINT unless the process explicitly installs handlers. Shells and JVMs behave differently.
- Zombie reaping — Orphaned children are reparented to PID 1. If PID 1 doesn't call wait(), zombies accumulate (defunct in ps).
- Shutdown path — docker stop → SIGTERM to PID 1 → grace period → SIGKILL. Your app must handle SIGTERM to drain connections and flush state.
Shell form trap
Dockerfile shell form (CMD ./start.sh) wraps your command in /bin/sh -c. The shell becomes PID 1—not your app. The shell often does not forward signals to child processes, so Java, Node, or nginx never see SIGTERM.
# BAD — shell is PID 1; SIGTERM may not reach java
FROM eclipse-temurin:21-jre
COPY app.jar /app.jar
CMD java -jar /app.jar
# BETTER — exec form: java is PID 1
CMD ["java", "-jar", "/app.jar"]
# BEST for shell scripts — exec replaces shell with script's final exec
CMD ["/start.sh"]
In a shell script used as entrypoint, end with exec "$@" or exec java -jar ... so the application binary replaces the shell and becomes PID 1.
Exec form fix
Exec form (CMD ["java", "-jar", "app.jar"]) invokes the binary directly—no shell wrapper. Signals go straight to your application. This is the minimum bar for production JVM, Node, and Go images.
tini and docker run --init
When you must run a shell script or a process that spawns children (nginx master/worker, supervisord), use a minimal init system. tini (bundled as docker-init) becomes PID 1: it forwards signals and reaps zombies. Enable with docker run --init or Dockerfile ENTRYPOINT ["/sbin/tini", "--"].
# Dockerfile with tini
RUN apt-get update && apt-get install -y tini && rm -rf /var/lib/apt/lists/*
ENTRYPOINT ["/usr/bin/tini", "--"]
CMD ["nginx", "-g", "daemon off;"]
# Or at runtime without image change
# docker run --init nginx
Spring Boot SIGTERM
Spring Boot registers a shutdown hook on SIGTERM when running as PID 1 with exec form. It stops accepting new requests, completes in-flight work (within timeout), closes the application context, and exits. Configure grace period alignment:
- server.shutdown=graceful + spring.lifecycle.timeout-per-shutdown-phase=30s
- docker stop -t 35 <container> — stop timeout must exceed Spring's shutdown phase
- Kubernetes terminationGracePeriodSeconds must exceed both
sequenceDiagram
participant D as docker stop
participant C as containerd
participant I as init (tini / app)
participant A as app process
D->>C: Stop container (timeout T)
C->>I: SIGTERM to PID 1
alt exec form / tini
I->>A: Forward SIGTERM
A->>A: Drain requests, flush state
A->>C: exit 0
else shell form without exec
I->>I: Shell exits or ignores
Note over C: Wait T seconds
C->>I: SIGKILL
end
Shell-form CMD java -jar app.jar in production: Deployments look fine until the first rolling update—connections drop, transactions abort, and logs show SIGKILL after 10s. Always verify with docker top <container>: PID 1 should be java, not sh.
Kubernetes sends SIGTERM to PID 1 inside the pod container—the same rules apply. Platform teams at scale mandate exec-form ENTRYPOINT, STOPSIGNAL SIGTERM, and documented stop timeouts in Helm charts. A 10s default docker stop breaks Spring apps with 30s graceful shutdown.
"Why do containers need an init process?" — PID 1 must reap zombies and forward signals. Shell-form CMD makes sh PID 1, which fails both jobs. Solutions: exec form, exec in scripts, tini/--init, or distroless with direct binary entrypoint.
Inspecting & debugging
Production debugging is a loop: observe state, compare to expected, narrow the blast radius. Docker's inspection commands expose containerd metadata, cgroup stats, and filesystem diffs—use them before reaching for ssh on the host.
Command reference
| Command | What it shows | Key flags |
|---|---|---|
| docker logs | stdout/stderr from container process | -f follow, --tail 100, --since 1h, -t timestamps |
| docker exec | Run command in running container's namespaces | -it interactive shell, -u root user override |
| docker inspect | Full JSON config + state (IP, mounts, env, health) | --format '{{.State.OOMKilled}}' |
| docker stats | Live CPU, memory, net I/O, block I/O | --no-stream one-shot; reads cgroup counters |
| docker top | Processes inside container (host PIDs mapped) | Uses container PID namespace view |
| docker diff | Files changed in writable layer vs image | A/C/D markers for added/changed/deleted |
| docker cp | Copy files host ↔ container filesystem | Works on stopped containers too |
Logs — first stop for runtime errors
# Stream logs with timestamps
docker logs -f -t --tail 200 payments-api
# Logs since last deploy (approximate)
docker logs --since 2026-06-05T10:00:00 payments-api
# Exit code of stopped container
docker inspect --format '{{.State.ExitCode}}' payments-api
Logs are stored by the logging driver (default json-file on disk under /var/lib/docker/containers/<id>/). They survive container stop but not docker rm unless forwarded to a centralized driver (fluentd, awslogs, etc.).
Exec — interactive debugging
$ docker exec -it payments-api sh /app $ ps aux PID USER COMMAND 1 10001 java -jar app.jar 47 10001 sh /app $ curl -s localhost:8080/actuator/health {"status":"UP"}
Distroless images have no shell: docker exec -it app sh fails. Use debug variants (:debug tags with busybox), docker debug (Docker 4.19+), or sidecar diagnostic containers with shared PID namespace (--pid=container:app).
Inspect — the source of truth
# High-value inspect templates
docker inspect payments-api --format 'Status={{.State.Status}} OOM={{.State.OOMKilled}} Exit={{.State.ExitCode}}'
docker inspect payments-api --format '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}'
docker inspect payments-api --format '{{json .HostConfig.Memory}}'
# Pretty-print full state
docker inspect payments-api | jq '.[0].State'
Stats and top — live resource view
docker stats reads cgroup v2 counters—memory usage includes cache; compare against --memory limit. docker top lists processes with host PIDs, bridging container and host views for signal debugging.
Diff and cp — filesystem forensics
docker diff shows what the writable layer changed—useful when an app writes config at runtime or malware is suspected. docker cp extracts heap dumps, thread dumps, or config files without installing tools in production images.
Host-level debugging: nsenter and /proc/<pid>/root
When docker exec isn't enough, pivot to the host. Every container process has a host PID (from docker inspect --format '{{.State.Pid}}'). From there:
- /proc/<pid>/root — view the container's filesystem root from the host
- nsenter -t <pid> -m -u -i -n -p sh — enter mount, UTS, IPC, net, and PID namespaces
- /proc/<pid>/ns/ — compare namespace inodes with other containers or host
PID=$(docker inspect --format '{{.State.Pid}}' payments-api)
# Browse container rootfs from host
sudo ls /proc/$PID/root/app
# Enter all namespaces (requires root on host)
sudo nsenter -t $PID -m -u -i -n -p -- ps aux
# Compare network namespace with host
readlink /proc/$PID/ns/net
readlink /proc/1/ns/net
docker exec creates a new process inside the container's namespaces via runc—it does not SSH into a VM. The exec'd process shares the network stack, mount table, and (by default) PID namespace with the running container. nsenter does the same from outside dockerd, useful when the Docker socket is unavailable.
Build a personal cheat sheet of docker inspect --format one-liners for Status, OOMKilled, RestartCount, IP, and Memory limit. In incidents, these five fields answer 80% of "why did it die?" before you open the full JSON blob.
Container resource limits
Without cgroup limits, a container is just a process group on the host—free to consume all RAM, all CPU, and fork without bound. Limits are not optional in shared environments; they are the contract between platform and application teams.
Memory limits
--memory sets the hard ceiling (memory.max in cgroup v2). When the container's memory usage (anon + mapped file cache charged to cgroup) exceeds the limit, the kernel OOM killer selects a process in the cgroup—usually your app—and sends SIGKILL.
| Flag | Effect | Notes |
|---|---|---|
| --memory=512m | Hard RAM limit | Exceeding → OOMKill |
| --memory-swap=1g | RAM + swap combined cap | Must be ≥ memory; -1 = unlimited swap (dangerous) |
| --memory-reservation=256m | Soft limit (best-effort) | Eviction pressure before hard limit on some systems |
| --oom-kill-disable | Disable OOM killer for container | Rare; can hang host under memory pressure |
CPU limits
--cpus=1.5 sets CFS quota: 150% of one core (or spread across cores). --cpu-shares (default 1024) only matters when CPUs are contended—it is relative weight, not a guarantee. For latency-sensitive services, prefer explicit --cpus over shares alone.
# Verify cgroup limits from host (v2)
CID=$(docker ps -qf name=payments-api)
CGROUP=$(docker inspect --format '{{.Id}}' $CID)
cat /sys/fs/cgroup/system.slice/docker-${CGROUP}.scope/memory.max
cat /sys/fs/cgroup/system.slice/docker-${CGROUP}.scope/cpu.max
# Live usage
docker stats --no-stream payments-api
OOMKilled — diagnosis
$ docker inspect payments-api --format 'OOM={{.State.OOMKilled}} Exit={{.State.ExitCode}}' OOM=true Exit=137 $ dmesg | tail -5 | grep -i oom Memory cgroup out of memory: Killed process 91234 (java) total-vm:2147484kB, anon-rss:524288kB $ docker logs --tail 20 payments-api # Often no graceful message — SIGKILL is not catchable
Exit code 137 = 128 + 9 (SIGKILL)—strong hint for OOM or docker kill. Java without container-aware heap sizing sets -Xmx from host RAM, not cgroup limit—use -XX:+UseContainerSupport (default Java 10+) and prefer -XX:MaxRAMPercentage over hard-coded -Xmx.
Danger of no limits
| Risk | Without limits | With limits |
|---|---|---|
| Memory hog | Container consumes host RAM → host OOM kills random processes | OOM isolated to container cgroup |
| CPU starvation | Batch job pegs all cores; latency spikes on co-hosted services | --cpus caps burn; shares allocate fairly |
| Fork bomb | Unbounded fork() exhausts host PIDs | --pids-limit stops cascade |
| Blast radius | One bad deploy takes down the node | Failed container; neighbors survive |
DoS via resource exhaustion is a security issue, not just performance. Always set --memory, --cpus, and --pids-limit on untrusted or multi-tenant workloads. Pair with ulimit in image and read-only rootfs to reduce attack surface.
Tight limits vs headroom: Too-low memory causes OOM thrashing under legitimate spikes; too-high limits waste cluster capacity and increase noisy-neighbor risk. Size from load tests at P99, add 20–30% headroom, and alert on docker stats memory >80% sustained—not just on OOM events.
Kubernetes maps resources.requests/limits directly to cgroup v2 fields—the same primitives as docker run --memory. Platform SREs often enforce limit ratios in admission webhooks. A Java service OOMing at 512Mi limit but "working fine" locally on 16Gi laptop is the most common container resource ticket.
"What happens when a container exceeds its memory limit?" — Kernel OOM killer targets a process in the cgroup; Docker sets OOMKilled=true, exit 137. Not the same as JVM OutOfMemoryError (which is catchable)—cgroup OOM is SIGKILL. Mention Java UseContainerSupport and sizing -Xmx below cgroup limit.