Container Runtime & Lifecycle

Container lifecycle

A container is not a VM—it is a stateful object in Docker's metadata store backed by a live (or frozen) process tree. Understanding the state machine is the foundation for graceful shutdowns, restart policies, and why docker create exists separately from docker run.

State machine

Docker tracks each container through a well-defined lifecycle. The writable container layer (the "upperdir" in OverlayFS) persists until the container is removed—even after stop. Paused containers freeze all processes via cgroup freezer; stopped containers have exited but retain filesystem and metadata.

stateDiagram-v2
  [*] --> created: docker create
  created --> running: docker start / docker run
  running --> paused: docker pause
  paused --> running: docker unpause
  running --> stopped: exit / docker stop / docker kill
  paused --> stopped: docker kill
  stopped --> running: docker start
  stopped --> removed: docker rm
  running --> removed: docker rm -f
  created --> removed: docker rm
  removed --> [*]

State	Process tree	Writable layer	Typical next action
created	Not started; bundle prepared by containerd	Allocated, empty	docker start
running	Active; PID 1 (or init) executing	Accepting writes	docker stop, docker exec
paused	Frozen (cgroup freezer)	Preserved	docker unpause
stopped	Exited; exit code recorded	Preserved (diff vs image)	docker start, docker rm
removed	Gone	Deleted (unless volume)	—

docker create vs docker run

docker create builds the OCI bundle—namespaces, cgroups, mounts, env, network endpoint—but does not call runc start. The container sits in created state until docker start. This split is useful for pre-provisioning containers, attaching to custom networks before start, or orchestrators that separate "define" from "launch."

docker run is create + start in one command. With -d it returns immediately after backgrounding; without -d it attaches to stdout/stderr and blocks until the container exits.

# Create without starting — inspect config before launch
docker create --name api-cache -e REDIS_PASSWORD=secret redis:7-alpine
docker inspect api-cache --format '{{.State.Status}}'   # created

# Start when ready
docker start api-cache

# Equivalent one-liner
docker run -d --name api-cache -e REDIS_PASSWORD=secret redis:7-alpine

Start, stop, and kill

Command	Signal	Behavior	Use when
docker stop	SIGTERM → wait → SIGKILL	Graceful shutdown; default 10s timeout (-t)	Normal shutdown; apps handle SIGTERM
docker kill	SIGKILL (default) or custom	Immediate; no cleanup hooks	Stuck container, runaway process
docker start	—	Re-runs container with same ID, layer, config	Restart after stop (not a fresh create)
docker restart	stop + start	Same container ID; processes re-exec entrypoint	Config reload that needs full process restart

🔬 Under the Hood

docker stop sends SIGTERM to PID 1 in the container's PID namespace—not necessarily the host PID you see in docker top. containerd waits for the stop timeout, then escalates to SIGKILL. If PID 1 ignores SIGTERM (shell-form CMD trap), you'll wait the full timeout every deploy.

Pause and unpause

docker pause freezes every process in the container using the cgroup freezer controller. Memory stays allocated; network connections may time out on peers. Rare in app deploys, but useful for consistent filesystem snapshots or debugging race conditions without tearing down the process tree.

Remove (docker rm)

Removing a stopped container deletes its writable layer and frees the name. Volumes declared with -v on docker rm also delete anonymous volumes attached to that container. Named volumes survive unless explicitly removed with docker volume rm. docker rm -f kills a running container first—dangerous in production without orchestrator coordination.

Restart policies (--restart)

Restart policies tell the Docker daemon whether to automatically restart a container when it exits or when the daemon itself restarts. Set at docker run or docker update; stored in container config.

Policy	On exit	On daemon restart	Typical use
no (default)	Stays stopped	Stays stopped	CI jobs, one-off tasks, docker run --rm
always	Always restart (any exit code)	Restarts running containers	Long-lived daemons; beware restart loops on crash
on-failure	Restart only on non-zero exit	Restarts if was running	Batch workers; optional :max-retries
unless-stopped	Restart unless manually stopped	Restarts unless you explicitly stopped it	Production single-host; survives reboot

# Restart up to 3 times on failure only
docker run -d --restart on-failure:3 --name worker myapp:latest

# Survive host reboot — won't restart if you docker stop it manually
docker run -d --restart unless-stopped --name edge-proxy nginx:alpine

# Check restart count after crash loop
docker inspect --format '{{.RestartCount}}' worker

⚠️ Pitfall

Restart loop masking bugs: --restart always on a container that exits immediately (bad config, missing env var) creates a tight crash loop. Disk fills with logs, CPU churns, and the real error scrolls past. Fix the root cause; use on-failure:N caps and alert on RestartCount.

🎯 Interview Tip

"Difference between docker stop and docker kill?" — stop = SIGTERM with grace period for cleanup; kill = immediate SIGKILL (or custom signal). Follow up: "What receives SIGTERM?" → PID 1 in the container namespace; hence the PID 1 problem for shell-form entrypoints.

Running containers

docker run is the most powerful command in the Docker CLI—every flag maps to OCI runtime spec fields, cgroup writes, or network namespace setup. Treat this table as your field reference; production correctness lives in the flags you don't forget.

docker run flags reference

Flag	Purpose	Example	Internals note
-d	Detach; run in background	docker run -d nginx	containerd starts process; CLI returns container ID
-i	Keep STDIN open	docker run -i alpine cat	Interactive input without TTY allocation
-t	Allocate pseudo-TTY	docker run -it bash	Sets TERM; required for shells, vim
--name	Human-readable name (unique)	--name api-gateway	DNS on user-defined networks resolves by name
--rm	Auto-remove on exit	docker run --rm alpine echo hi	Deletes writable layer; great for CI, bad for post-mortem
-p	Publish port (host:container)	-p 8080:80, -p 127.0.0.1:8080:80	iptables DNAT via docker-proxy or hairpin NAT
-P	Publish all EXPOSE'd ports to random host ports	docker run -P nginx	Reads image config ExposedPorts
-v	Bind mount or named volume (legacy syntax)	-v data:/var/lib/mysql	Direct mount into MNT namespace
--mount	Explicit mount (preferred)	--mount type=bind,src=/host,dst=/app,ro	Clearer semantics: bind, volume, tmpfs
-e	Set environment variable	-e JAVA_OPTS=-Xmx512m	Injected into OCI process env block
--env-file	Load env vars from file	--env-file .env.prod	KEY=VALUE lines; supports # comments
--network	Attach to network	--network backend	Creates veth pair, assigns IP on bridge/overlay
--network-alias	Extra DNS names on network	--network-alias db	Embedded DNS (127.0.0.11) resolves aliases
--user	Run as UID:GID or name	--user 1001:1001	USER namespace mapping in rootless mode
--read-only	Root filesystem read-only	--read-only	Upperdir still exists but mounted ro; pair with tmpfs for writes
--tmpfs	RAM-backed writable mount	--tmpfs /tmp:rw,noexec,nosuid,size=64m	Ephemeral; gone when container removed
--security-opt	Security options	--security-opt no-new-privileges	Sets prctl PR_SET_NO_NEW_PRIVS; blocks setuid escalation
--cap-drop / --cap-add	Linux capabilities	--cap-drop ALL --cap-add NET_BIND_SERVICE	Default Docker drops many caps; never add SYS_ADMIN casually
--memory	Hard memory limit	--memory 512m	Writes memory.max in cgroup v2
--cpus	CPU quota (cores as float)	--cpus 1.5	cpu.max = quota/period (default period 100ms)
--pids-limit	Max processes in container	--pids-limit 200	Prevents fork bombs; default -1 (unlimited) on some hosts

Production-shaped example

$ docker run -d \
  --name payments-api \
  --restart unless-stopped \
  --network backend \
  --network-alias payments \
  -p 127.0.0.1:8080:8080 \
  -e SPRING_PROFILES_ACTIVE=prod \
  --env-file /etc/payments.env \
  --user 10001:10001 \
  --read-only \
  --tmpfs /tmp:rw,noexec,nosuid,size=128m \
  --security-opt no-new-privileges \
  --cap-drop ALL \
  --memory 1g \
  --cpus 2 \
  --pids-limit 512 \
  --init \
  myregistry/payments-api:sha256-abc123
f3a8c2d91e04b7...

🔒 Security

Default-deny capabilities: Start with --cap-drop ALL and add only what the app needs (typically NET_BIND_SERVICE if binding <1024). Combine with --read-only, no-new-privileges, and non-root --user. Mounting /var/run/docker.sock effectively grants host root—treat it as a critical CVE.

💡 Pro Tip

Prefer --mount over -v for clarity: type=bind,src=/data,dst=/data,readonly is unambiguous. For named volumes: --mount type=volume,source=pgdata,target=/var/lib/postgresql/data. Bind mounts to host paths break portability—use named volumes or external storage drivers in prod.

⚖️ Trade-off

-p 0.0.0.0:8080:8080 vs 127.0.0.1:8080:8080: Binding all interfaces exposes the port on every host NIC—including public ones. Loopback-only is safer for dev; production should use a reverse proxy or ingress controller, not raw Docker port publish on 0.0.0.0.

PID 1 problem

Inside a container, your main process is PID 1. The Linux kernel treats PID 1 differently: it ignores default signal dispositions, must reap zombie children, and receives shutdown signals from Docker during docker stop. Get PID 1 wrong and graceful deploys become hard kills.

Why PID 1 is special

Signal handling — PID 1 ignores SIGTERM/SIGINT unless the process explicitly installs handlers. Shells and JVMs behave differently.
Zombie reaping — Orphaned children are reparented to PID 1. If PID 1 doesn't call wait(), zombies accumulate (defunct in ps).
Shutdown path — docker stop → SIGTERM to PID 1 → grace period → SIGKILL. Your app must handle SIGTERM to drain connections and flush state.

Shell form trap

Dockerfile shell form (CMD ./start.sh) wraps your command in /bin/sh -c. The shell becomes PID 1—not your app. The shell often does not forward signals to child processes, so Java, Node, or nginx never see SIGTERM.

# BAD — shell is PID 1; SIGTERM may not reach java
FROM eclipse-temurin:21-jre
COPY app.jar /app.jar
CMD java -jar /app.jar

# BETTER — exec form: java is PID 1
CMD ["java", "-jar", "/app.jar"]

# BEST for shell scripts — exec replaces shell with script's final exec
CMD ["/start.sh"]

In a shell script used as entrypoint, end with exec "$@" or exec java -jar ... so the application binary replaces the shell and becomes PID 1.

Exec form fix

Exec form (CMD ["java", "-jar", "app.jar"]) invokes the binary directly—no shell wrapper. Signals go straight to your application. This is the minimum bar for production JVM, Node, and Go images.

tini and docker run --init

When you must run a shell script or a process that spawns children (nginx master/worker, supervisord), use a minimal init system. tini (bundled as docker-init) becomes PID 1: it forwards signals and reaps zombies. Enable with docker run --init or Dockerfile ENTRYPOINT ["/sbin/tini", "--"].

# Dockerfile with tini
RUN apt-get update && apt-get install -y tini && rm -rf /var/lib/apt/lists/*
ENTRYPOINT ["/usr/bin/tini", "--"]
CMD ["nginx", "-g", "daemon off;"]

# Or at runtime without image change
# docker run --init nginx

Spring Boot SIGTERM

Spring Boot registers a shutdown hook on SIGTERM when running as PID 1 with exec form. It stops accepting new requests, completes in-flight work (within timeout), closes the application context, and exits. Configure grace period alignment:

server.shutdown=graceful + spring.lifecycle.timeout-per-shutdown-phase=30s
docker stop -t 35 <container> — stop timeout must exceed Spring's shutdown phase
Kubernetes terminationGracePeriodSeconds must exceed both

sequenceDiagram
  participant D as docker stop
  participant C as containerd
  participant I as init (tini / app)
  participant A as app process
  D->>C: Stop container (timeout T)
  C->>I: SIGTERM to PID 1
  alt exec form / tini
    I->>A: Forward SIGTERM
    A->>A: Drain requests, flush state
    A->>C: exit 0
  else shell form without exec
    I->>I: Shell exits or ignores
    Note over C: Wait T seconds
    C->>I: SIGKILL
  end

⚠️ Pitfall

Shell-form CMD java -jar app.jar in production: Deployments look fine until the first rolling update—connections drop, transactions abort, and logs show SIGKILL after 10s. Always verify with docker top <container>: PID 1 should be java, not sh.

📦 Real World

Kubernetes sends SIGTERM to PID 1 inside the pod container—the same rules apply. Platform teams at scale mandate exec-form ENTRYPOINT, STOPSIGNAL SIGTERM, and documented stop timeouts in Helm charts. A 10s default docker stop breaks Spring apps with 30s graceful shutdown.

🎯 Interview Tip

"Why do containers need an init process?" — PID 1 must reap zombies and forward signals. Shell-form CMD makes sh PID 1, which fails both jobs. Solutions: exec form, exec in scripts, tini/--init, or distroless with direct binary entrypoint.

Inspecting & debugging

Production debugging is a loop: observe state, compare to expected, narrow the blast radius. Docker's inspection commands expose containerd metadata, cgroup stats, and filesystem diffs—use them before reaching for ssh on the host.

Command reference

Command	What it shows	Key flags
docker logs	stdout/stderr from container process	-f follow, --tail 100, --since 1h, -t timestamps
docker exec	Run command in running container's namespaces	-it interactive shell, -u root user override
docker inspect	Full JSON config + state (IP, mounts, env, health)	--format '{{.State.OOMKilled}}'
docker stats	Live CPU, memory, net I/O, block I/O	--no-stream one-shot; reads cgroup counters
docker top	Processes inside container (host PIDs mapped)	Uses container PID namespace view
docker diff	Files changed in writable layer vs image	A/C/D markers for added/changed/deleted
docker cp	Copy files host ↔ container filesystem	Works on stopped containers too

Logs — first stop for runtime errors

# Stream logs with timestamps
docker logs -f -t --tail 200 payments-api

# Logs since last deploy (approximate)
docker logs --since 2026-06-05T10:00:00 payments-api

# Exit code of stopped container
docker inspect --format '{{.State.ExitCode}}' payments-api

Logs are stored by the logging driver (default json-file on disk under /var/lib/docker/containers/<id>/). They survive container stop but not docker rm unless forwarded to a centralized driver (fluentd, awslogs, etc.).

Exec — interactive debugging

$ docker exec -it payments-api sh
/app $ ps aux
PID   USER     COMMAND
    1 10001    java -jar app.jar
   47 10001    sh

/app $ curl -s localhost:8080/actuator/health
{"status":"UP"}

⚠️ Pitfall

Distroless images have no shell: docker exec -it app sh fails. Use debug variants (:debug tags with busybox), docker debug (Docker 4.19+), or sidecar diagnostic containers with shared PID namespace (--pid=container:app).

Inspect — the source of truth

# High-value inspect templates
docker inspect payments-api --format 'Status={{.State.Status}} OOM={{.State.OOMKilled}} Exit={{.State.ExitCode}}'
docker inspect payments-api --format '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}'
docker inspect payments-api --format '{{json .HostConfig.Memory}}'

# Pretty-print full state
docker inspect payments-api | jq '.[0].State'

Stats and top — live resource view

docker stats reads cgroup v2 counters—memory usage includes cache; compare against --memory limit. docker top lists processes with host PIDs, bridging container and host views for signal debugging.

Diff and cp — filesystem forensics

docker diff shows what the writable layer changed—useful when an app writes config at runtime or malware is suspected. docker cp extracts heap dumps, thread dumps, or config files without installing tools in production images.

Host-level debugging: nsenter and /proc/<pid>/root

When docker exec isn't enough, pivot to the host. Every container process has a host PID (from docker inspect --format '{{.State.Pid}}'). From there:

/proc/<pid>/root — view the container's filesystem root from the host
nsenter -t <pid> -m -u -i -n -p sh — enter mount, UTS, IPC, net, and PID namespaces
/proc/<pid>/ns/ — compare namespace inodes with other containers or host

PID=$(docker inspect --format '{{.State.Pid}}' payments-api)

# Browse container rootfs from host
sudo ls /proc/$PID/root/app

# Enter all namespaces (requires root on host)
sudo nsenter -t $PID -m -u -i -n -p -- ps aux

# Compare network namespace with host
readlink /proc/$PID/ns/net
readlink /proc/1/ns/net

🔬 Under the Hood

docker exec creates a new process inside the container's namespaces via runc—it does not SSH into a VM. The exec'd process shares the network stack, mount table, and (by default) PID namespace with the running container. nsenter does the same from outside dockerd, useful when the Docker socket is unavailable.

💡 Pro Tip

Build a personal cheat sheet of docker inspect --format one-liners for Status, OOMKilled, RestartCount, IP, and Memory limit. In incidents, these five fields answer 80% of "why did it die?" before you open the full JSON blob.

Container resource limits

Without cgroup limits, a container is just a process group on the host—free to consume all RAM, all CPU, and fork without bound. Limits are not optional in shared environments; they are the contract between platform and application teams.

Memory limits

--memory sets the hard ceiling (memory.max in cgroup v2). When the container's memory usage (anon + mapped file cache charged to cgroup) exceeds the limit, the kernel OOM killer selects a process in the cgroup—usually your app—and sends SIGKILL.

Flag	Effect	Notes
--memory=512m	Hard RAM limit	Exceeding → OOMKill
--memory-swap=1g	RAM + swap combined cap	Must be ≥ memory; -1 = unlimited swap (dangerous)
--memory-reservation=256m	Soft limit (best-effort)	Eviction pressure before hard limit on some systems
--oom-kill-disable	Disable OOM killer for container	Rare; can hang host under memory pressure

CPU limits

--cpus=1.5 sets CFS quota: 150% of one core (or spread across cores). --cpu-shares (default 1024) only matters when CPUs are contended—it is relative weight, not a guarantee. For latency-sensitive services, prefer explicit --cpus over shares alone.

# Verify cgroup limits from host (v2)
CID=$(docker ps -qf name=payments-api)
CGROUP=$(docker inspect --format '{{.Id}}' $CID)
cat /sys/fs/cgroup/system.slice/docker-${CGROUP}.scope/memory.max
cat /sys/fs/cgroup/system.slice/docker-${CGROUP}.scope/cpu.max

# Live usage
docker stats --no-stream payments-api

OOMKilled — diagnosis

$ docker inspect payments-api --format 'OOM={{.State.OOMKilled}} Exit={{.State.ExitCode}}'
OOM=true Exit=137

$ dmesg | tail -5 | grep -i oom
Memory cgroup out of memory: Killed process 91234 (java) total-vm:2147484kB, anon-rss:524288kB

$ docker logs --tail 20 payments-api
# Often no graceful message — SIGKILL is not catchable

Exit code 137 = 128 + 9 (SIGKILL)—strong hint for OOM or docker kill. Java without container-aware heap sizing sets -Xmx from host RAM, not cgroup limit—use -XX:+UseContainerSupport (default Java 10+) and prefer -XX:MaxRAMPercentage over hard-coded -Xmx.

Danger of no limits

Risk	Without limits	With limits
Memory hog	Container consumes host RAM → host OOM kills random processes	OOM isolated to container cgroup
CPU starvation	Batch job pegs all cores; latency spikes on co-hosted services	--cpus caps burn; shares allocate fairly
Fork bomb	Unbounded fork() exhausts host PIDs	--pids-limit stops cascade
Blast radius	One bad deploy takes down the node	Failed container; neighbors survive

🔒 Security

DoS via resource exhaustion is a security issue, not just performance. Always set --memory, --cpus, and --pids-limit on untrusted or multi-tenant workloads. Pair with ulimit in image and read-only rootfs to reduce attack surface.

⚖️ Trade-off

Tight limits vs headroom: Too-low memory causes OOM thrashing under legitimate spikes; too-high limits waste cluster capacity and increase noisy-neighbor risk. Size from load tests at P99, add 20–30% headroom, and alert on docker stats memory >80% sustained—not just on OOM events.

📦 Real World

Kubernetes maps resources.requests/limits directly to cgroup v2 fields—the same primitives as docker run --memory. Platform SREs often enforce limit ratios in admission webhooks. A Java service OOMing at 512Mi limit but "working fine" locally on 16Gi laptop is the most common container resource ticket.

🎯 Interview Tip

"What happens when a container exceeds its memory limit?" — Kernel OOM killer targets a process in the cgroup; Docker sets OOMKilled=true, exit 137. Not the same as JVM OutOfMemoryError (which is catchable)—cgroup OOM is SIGKILL. Mention Java UseContainerSupport and sizing -Xmx below cgroup limit.