Container Internals & Linux Primitives
Docker is not magic—it is a thin orchestration layer over kernel features that have existed for years. Every docker run ends in clone(), cgroup writes, and an OverlayFS mount. Understand these primitives and production debugging becomes mechanical instead of mystical.
Linux namespaces
Namespaces isolate a process's view of system resources. The process still shares the host kernel, but believes it has its own PID tree, network stack, mount table, and hostname. Think of namespaces as passports—each container carries credentials that say "you live in world X," even though many worlds share one physical machine.
What namespaces do
Without namespaces, ps aux inside a container would list every process on the host. With PID namespaces, the container sees only its own process tree. NET namespaces give each container its own eth0, routing table, and iptables rules. MNT namespaces control which filesystem paths are visible—this is how the container's root filesystem differs from the host's.
The eight namespace types (Linux 5.6+)
| Namespace | Isolates | Docker relevance | Pitfall |
|---|---|---|---|
| PID | Process tree; PID 1 inside container | Container init = PID 1; orphan reaping responsibility | Host PIDs visible with --pid=host |
| NET | Interfaces, routes, iptables, sockets | Bridge networking, port publishing via NAT | --net=host shares host stack—no isolation |
| MNT | Filesystem mount points | Container rootfs, volume/bind mounts | Bind mounts can leak host paths into container view |
| UTS | Hostname and NIS domain name | --hostname flag | Cosmetic only—does not affect DNS resolution by itself |
| IPC | Shared memory, semaphores, message queues | Isolates /dev/shm between containers | --ipc=host needed for some legacy SHM apps |
| USER | UID/GID mapping (rootless containers) | Container root maps to unprivileged host user | File ownership on bind mounts must match mapped IDs |
| CGROUP | View of cgroup hierarchy | Container sees its own cgroup subtree (v2) | Misconfigured delegation breaks resource limits |
| TIME (5.6+) | Boot time and monotonic clock offsets | Niche: time-travel testing, NTP isolation | Requires CAP_SYS_TIME or appropriate delegation |
How Docker creates namespaces
runc calls the clone() syscall (or unshare() for namespace-only changes) with CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC flags. The new process enters fresh namespaces before executing your container entrypoint.
# Run a container and capture its PID on the host
CID=$(docker run -d nginx:alpine)
PID=$(docker inspect --format '{{.State.Pid}}' $CID)
# List namespace links for the container process
ls -la /proc/$PID/ns/
# Compare host init namespace vs container (different inode numbers = isolated)
readlink /proc/1/ns/pid
readlink /proc/$PID/ns/pid
Namespace inodes in /proc/<pid>/ns/ are persistent handles. Two processes sharing the same inode number for net are in the same network namespace—useful when debugging sidecar patterns or docker run --network=container:<name>.
"What's the difference between a container and a VM?" — VMs virtualize hardware and run a guest kernel. Containers virtualize the process view via namespaces on a shared kernel. Container escape = breaking out of namespace/cgroup/seccomp confinement to affect the host.
cgroups (control groups)
If namespaces are passports, cgroups are resource budgets. They limit, account for, and isolate CPU, memory, I/O, and process count. Without cgroups, a container could fork-bomb the host or consume all RAM.
cgroups v1 vs cgroups v2
| Aspect | cgroups v1 (legacy) | cgroups v2 (unified hierarchy) |
|---|---|---|
| Hierarchy | Multiple trees per controller (cpu, memory, …) | Single unified tree at /sys/fs/cgroup |
| Default on | RHEL 7, older Ubuntu | Modern Fedora, Ubuntu 22.04+, Docker 20.10+ |
| Resource delegation | Complex edge cases with systemd | Cleaner delegation for rootless containers |
| Memory + swap | Separate memory and memsw controllers | Unified memory.max + memory.swap.max |
Resource controllers
| Controller | Purpose | Docker flag | Kernel file (v2) |
|---|---|---|---|
| cpu | CPU bandwidth limit | --cpus=1.5 | cpu.max (quota/period) |
| cpu.weight | Relative CPU share when contended | --cpu-shares=512 | cpu.weight |
| memory | Hard memory limit | --memory=512m | memory.max |
| memory.swap | Swap limit (RAM + swap combined) | --memory-swap | memory.swap.max |
| io | Block I/O weight and limits | --device-read-bps | io.max |
| pids | Max number of processes | --pids-limit=100 | pids.max |
What happens when memory limit is exceeded
The kernel's OOM killer selects a process in the cgroup and sends SIGKILL. Docker reports this as OOMKilled in docker inspect. The container's PID 1 dies—often taking the whole container down. Java apps without -XX:+UseContainerSupport may not respect cgroup limits and get OOM-killed unexpectedly.
$ docker run --rm --memory 32m progrium/stress --vm 1 --vm-bytes 64M stress: FAIL: cannot allocate 67108864 bytes Killed $ docker inspect --format '{{.State.OOMKilled}}' <container_id> true $ cat /sys/fs/cgroup/system.slice/docker-*.scope/memory.events low 0 high 0 max 1 oom 1 oom_kill 1
Unbounded container resources in a shared environment is a production incident waiting to happen. One memory leak takes down the node. Always set --memory and --cpus—even in Docker Compose via the deploy.resources block.
--cpus=1.5 maps to CFS quota: cpu.cfs_quota_us = 150000 with cpu.cfs_period_us = 100000 (150% of one core). For JVM containers, pair memory limits with -XX:MaxRAMPercentage=75.0.
Union filesystems
Image layers are read-only diffs stacked like git commits. The container adds a thin writable layer on top. OverlayFS merges them into one coherent filesystem view—the container believes it has a normal root filesystem.
OverlayFS (default storage driver)
Since Linux 3.18, Docker on Linux uses OverlayFS. Four directories matter:
- lowerdir — read-only image layers (ordered bottom to top)
- upperdir — container's read-write layer (all modifications land here)
- workdir — OverlayFS internal scratch space (not visible to container)
- merged — unified view mounted as container rootfs
Copy-on-write (CoW)
Reads traverse the stack—upperdir first, then lower layers until the file is found. Writes to a file that exists only in a lower layer trigger a copy into upperdir first (CoW), then the write proceeds. This first-write penalty is why databases should use volumes, not the container layer.
Layer sharing
Ten containers from the same image share identical lower layers on disk—only their upperdirs differ. This is why pulling nginx:alpine once benefits every subsequent container using that image.
# Find the merged mount point for a running container
docker inspect --format '{{.GraphDriver.Data.MergedDir}}' mycontainer
# See overlay mount on the host
mount | grep overlay
# Storage driver in use
docker info --format '{{.Driver}}'
Other storage drivers
| Driver | Status | Notes |
|---|---|---|
| overlay2 | Default (recommended) | OverlayFS, xfs/ext4 backing, mature and fast for reads |
| devicemapper | Legacy (RHEL 7) | Thin provisioning; deprecated, avoid for new deployments |
| btrfs / zfs | Niche | Native CoW filesystems; operational complexity limits adoption |
| vfs | Testing only | No CoW—full copy per layer; extremely slow, no layer sharing |
Container layer vs volume: container layer is convenient but ephemeral and CoW-expensive for heavy writes. Volumes bypass OverlayFS upperdir overhead and survive docker rm. Bind mounts add host path coupling but enable dev hot-reload.
OverlayFS explorer
Click Read file vs Write file to see where OverlayFS serves data from—and when copy-on-write kicks in.
runc & the OCI runtime
The Open Container Initiative (OCI) defines portable specs. runc is the reference implementation— a small Go binary that reads a bundle and spawns an isolated process. Docker, containerd, Kubernetes, and Podman all converge here.
Container lifecycle
- runc create — set up namespaces, cgroups, rootfs; create container but don't start process
- runc start — execute the configured process (your ENTRYPOINT/CMD)
- Process runs — PID 1 inside container executes application logic
- runc delete — tear down cgroups, unmount, cleanup after exit
The OCI bundle: config.json
An OCI bundle is a directory containing rootfs/ and config.json. The config specifies the process (args, env, cwd), mounts, Linux namespaces, cgroups path, capabilities, and seccomp profile.
{
"ociVersion": "1.0.2",
"process": {
"terminal": false,
"user": { "uid": 0, "gid": 0 },
"args": ["/bin/sh", "-c", "nginx -g daemon off;"],
"env": ["PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"],
"cwd": "/"
},
"root": { "path": "rootfs", "readonly": false },
"linux": {
"namespaces": [
{ "type": "pid" },
{ "type": "network" },
{ "type": "mount" },
{ "type": "ipc" },
{ "type": "uts" }
],
"cgroupsPath": "/docker/abc123"
}
}
Docker architecture layers
flowchart TB CLI["docker CLI"] D["dockerd\nREST API · images · networks"] C["containerd\npull · snapshot · shim"] S["containerd-shim\nstdio · exit status"] R["runc\nclone · cgroups · exec"] K["Linux kernel\nnamespaces · OverlayFS"] CLI --> D D --> C C --> S S --> R R --> K
containerd sits above runc—it manages image content, creates snapshots (OverlayFS mounts), and supervises shims. containerd-shim keeps the container alive after dockerd disconnects and reports exit codes. This separation is why Kubernetes can use containerd directly without dockerd.
On a running host you can find OCI bundles under /run/containerd/io.containerd.runtime.v2.task/ (path varies by version). The bundle's config.json is exactly what runc consumed—inspect it to see real namespace and seccomp settings, not just Dockerfile intent.
Google helped create containerd and donated it to CNCF. Kubernetes 1.24+ removed built-in dockershim—nodes use containerd or CRI-O directly. Your Dockerfile is unchanged; only the node runtime differs.
"Docker vs containerd vs runc?" — runc spawns one container. containerd manages container lifecycle and images. dockerd adds developer UX (build, compose, bridge networks). Kubernetes needs only containerd + CRI.
Rootless Docker
Running the Docker daemon as root means a container escape could mean host root. Rootless mode maps container UID 0 to an unprivileged host user via USER namespaces—dramatically shrinking blast radius.
Why rootless matters
- Docker daemon compromise ≠ immediate host root access
- Developers can run containers without sudo on shared machines
- Aligns with least-privilege and CIS hardening guidance
How it works
A user namespace maps container root (UID 0) to a high unprivileged UID on the host (e.g. 100000). Processes inside the container believe they are root; the host kernel sees them as a normal user. Subuid/subgid ranges in /etc/subuid define the mapping.
Limitations
| Limitation | Reason | Workaround |
|---|---|---|
| Cannot bind ports < 1024 | Unprivileged users cannot bind privileged ports | Use ports ≥ 1024, or sysctl net.ipv4.ip_unprivileged_port_start=80 |
| Some storage drivers unsupported | Requires root for device setup | Use overlay2 with fuse-overlayfs |
| AppArmor/SELinux edge cases | Profile loading may need root | Use default profiles; test before prod |
| Performance overhead | Additional namespace mapping layers | Usually negligible for app workloads |
Podman: rootless-first alternative
Podman runs daemonless—each podman run is a direct fork/exec via containerd/runc under your user session. Rootless is the default. podman-docker provides a docker CLI alias. On RHEL 8+ and OpenShift developer environments, Podman replaced Docker Engine entirely.
| Aspect | Docker (rootful) | Podman (rootless default) |
|---|---|---|
| Daemon | Central dockerd (root) | None—fork/exec per container |
| Rootless | Opt-in setup | Default |
| Compose | docker compose | podman compose (compatible) |
| Kubernetes | dockerd removed from nodes | Pods via podman generate kube |
| OCI images | Fully compatible | Fully compatible—same registries |
Even in rootless mode, never mount sensitive host paths (/etc, /var/run/docker.sock) into containers. A malicious container can still read anything the mapped host user can access.
Docker vs Podman vs containerd: Docker wins on developer ergonomics and ecosystem tooling. Podman wins on rootless security and daemonless ops. containerd wins as the shared production runtime under Kubernetes. Architects often standardize on OCI images + one builder, with runtime choice per environment.