Container Internals & Linux Primitives

Docker is not magic—it is a thin orchestration layer over kernel features that have existed for years. Every docker run ends in clone(), cgroup writes, and an OverlayFS mount. Understand these primitives and production debugging becomes mechanical instead of mystical.

developer devops architect OCI Spec cgroups v2

Linux namespaces

Namespaces isolate a process's view of system resources. The process still shares the host kernel, but believes it has its own PID tree, network stack, mount table, and hostname. Think of namespaces as passports—each container carries credentials that say "you live in world X," even though many worlds share one physical machine.

What namespaces do

Without namespaces, ps aux inside a container would list every process on the host. With PID namespaces, the container sees only its own process tree. NET namespaces give each container its own eth0, routing table, and iptables rules. MNT namespaces control which filesystem paths are visible—this is how the container's root filesystem differs from the host's.

The eight namespace types (Linux 5.6+)

Namespace Isolates Docker relevance Pitfall
PID Process tree; PID 1 inside container Container init = PID 1; orphan reaping responsibility Host PIDs visible with --pid=host
NET Interfaces, routes, iptables, sockets Bridge networking, port publishing via NAT --net=host shares host stack—no isolation
MNT Filesystem mount points Container rootfs, volume/bind mounts Bind mounts can leak host paths into container view
UTS Hostname and NIS domain name --hostname flag Cosmetic only—does not affect DNS resolution by itself
IPC Shared memory, semaphores, message queues Isolates /dev/shm between containers --ipc=host needed for some legacy SHM apps
USER UID/GID mapping (rootless containers) Container root maps to unprivileged host user File ownership on bind mounts must match mapped IDs
CGROUP View of cgroup hierarchy Container sees its own cgroup subtree (v2) Misconfigured delegation breaks resource limits
TIME (5.6+) Boot time and monotonic clock offsets Niche: time-travel testing, NTP isolation Requires CAP_SYS_TIME or appropriate delegation

How Docker creates namespaces

runc calls the clone() syscall (or unshare() for namespace-only changes) with CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC flags. The new process enters fresh namespaces before executing your container entrypoint.

bash
# Run a container and capture its PID on the host
CID=$(docker run -d nginx:alpine)
PID=$(docker inspect --format '{{.State.Pid}}' $CID)

# List namespace links for the container process
ls -la /proc/$PID/ns/

# Compare host init namespace vs container (different inode numbers = isolated)
readlink /proc/1/ns/pid
readlink /proc/$PID/ns/pid
🔬 Under the Hood

Namespace inodes in /proc/<pid>/ns/ are persistent handles. Two processes sharing the same inode number for net are in the same network namespace—useful when debugging sidecar patterns or docker run --network=container:<name>.

🎯 Interview Tip

"What's the difference between a container and a VM?" — VMs virtualize hardware and run a guest kernel. Containers virtualize the process view via namespaces on a shared kernel. Container escape = breaking out of namespace/cgroup/seccomp confinement to affect the host.

cgroups (control groups)

If namespaces are passports, cgroups are resource budgets. They limit, account for, and isolate CPU, memory, I/O, and process count. Without cgroups, a container could fork-bomb the host or consume all RAM.

cgroups v1 vs cgroups v2

Aspect cgroups v1 (legacy) cgroups v2 (unified hierarchy)
Hierarchy Multiple trees per controller (cpu, memory, …) Single unified tree at /sys/fs/cgroup
Default on RHEL 7, older Ubuntu Modern Fedora, Ubuntu 22.04+, Docker 20.10+
Resource delegation Complex edge cases with systemd Cleaner delegation for rootless containers
Memory + swap Separate memory and memsw controllers Unified memory.max + memory.swap.max

Resource controllers

Controller Purpose Docker flag Kernel file (v2)
cpu CPU bandwidth limit --cpus=1.5 cpu.max (quota/period)
cpu.weight Relative CPU share when contended --cpu-shares=512 cpu.weight
memory Hard memory limit --memory=512m memory.max
memory.swap Swap limit (RAM + swap combined) --memory-swap memory.swap.max
io Block I/O weight and limits --device-read-bps io.max
pids Max number of processes --pids-limit=100 pids.max

What happens when memory limit is exceeded

The kernel's OOM killer selects a process in the cgroup and sends SIGKILL. Docker reports this as OOMKilled in docker inspect. The container's PID 1 dies—often taking the whole container down. Java apps without -XX:+UseContainerSupport may not respect cgroup limits and get OOM-killed unexpectedly.

terminal — OOMKilled
$ docker run --rm --memory 32m progrium/stress --vm 1 --vm-bytes 64M
stress: FAIL: cannot allocate 67108864 bytes
Killed

$ docker inspect --format '{{.State.OOMKilled}}' <container_id>
true

$ cat /sys/fs/cgroup/system.slice/docker-*.scope/memory.events
low 0
high 0
max 1
oom 1
oom_kill 1
⚠️ Pitfall

Unbounded container resources in a shared environment is a production incident waiting to happen. One memory leak takes down the node. Always set --memory and --cpus—even in Docker Compose via the deploy.resources block.

💡 Pro Tip

--cpus=1.5 maps to CFS quota: cpu.cfs_quota_us = 150000 with cpu.cfs_period_us = 100000 (150% of one core). For JVM containers, pair memory limits with -XX:MaxRAMPercentage=75.0.

Union filesystems

Image layers are read-only diffs stacked like git commits. The container adds a thin writable layer on top. OverlayFS merges them into one coherent filesystem view—the container believes it has a normal root filesystem.

OverlayFS (default storage driver)

Since Linux 3.18, Docker on Linux uses OverlayFS. Four directories matter:

  • lowerdir — read-only image layers (ordered bottom to top)
  • upperdir — container's read-write layer (all modifications land here)
  • workdir — OverlayFS internal scratch space (not visible to container)
  • merged — unified view mounted as container rootfs

Copy-on-write (CoW)

Reads traverse the stack—upperdir first, then lower layers until the file is found. Writes to a file that exists only in a lower layer trigger a copy into upperdir first (CoW), then the write proceeds. This first-write penalty is why databases should use volumes, not the container layer.

Layer sharing

Ten containers from the same image share identical lower layers on disk—only their upperdirs differ. This is why pulling nginx:alpine once benefits every subsequent container using that image.

bash
# Find the merged mount point for a running container
docker inspect --format '{{.GraphDriver.Data.MergedDir}}' mycontainer

# See overlay mount on the host
mount | grep overlay

# Storage driver in use
docker info --format '{{.Driver}}'

Other storage drivers

Driver Status Notes
overlay2 Default (recommended) OverlayFS, xfs/ext4 backing, mature and fast for reads
devicemapper Legacy (RHEL 7) Thin provisioning; deprecated, avoid for new deployments
btrfs / zfs Niche Native CoW filesystems; operational complexity limits adoption
vfs Testing only No CoW—full copy per layer; extremely slow, no layer sharing
⚖️ Trade-off

Container layer vs volume: container layer is convenient but ephemeral and CoW-expensive for heavy writes. Volumes bypass OverlayFS upperdir overhead and survive docker rm. Bind mounts add host path coupling but enable dev hot-reload.

OverlayFS explorer

Click Read file vs Write file to see where OverlayFS serves data from—and when copy-on-write kicks in.

lowerdir (image layer 2) Read-only · app binaries
lowerdir (image layer 1) Read-only · base OS 📄 /app/config.yml
upperdir (container layer) Read-write · empty until first write
workdir OverlayFS internal · not visible in container
merged (container view) Unified filesystem mounted at / 📄 /app/config.yml — visible here
  • Read: file found in lowerdir, served without copy
  • Write: file copied to upperdir (CoW), then modified
  • Delete: whiteout marker in upperdir hides lower file

runc & the OCI runtime

The Open Container Initiative (OCI) defines portable specs. runc is the reference implementation— a small Go binary that reads a bundle and spawns an isolated process. Docker, containerd, Kubernetes, and Podman all converge here.

Container lifecycle

  1. runc create — set up namespaces, cgroups, rootfs; create container but don't start process
  2. runc start — execute the configured process (your ENTRYPOINT/CMD)
  3. Process runs — PID 1 inside container executes application logic
  4. runc delete — tear down cgroups, unmount, cleanup after exit

The OCI bundle: config.json

An OCI bundle is a directory containing rootfs/ and config.json. The config specifies the process (args, env, cwd), mounts, Linux namespaces, cgroups path, capabilities, and seccomp profile.

json
{
  "ociVersion": "1.0.2",
  "process": {
    "terminal": false,
    "user": { "uid": 0, "gid": 0 },
    "args": ["/bin/sh", "-c", "nginx -g daemon off;"],
    "env": ["PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"],
    "cwd": "/"
  },
  "root": { "path": "rootfs", "readonly": false },
  "linux": {
    "namespaces": [
      { "type": "pid" },
      { "type": "network" },
      { "type": "mount" },
      { "type": "ipc" },
      { "type": "uts" }
    ],
    "cgroupsPath": "/docker/abc123"
  }
}

Docker architecture layers

flowchart TB
  CLI["docker CLI"]
  D["dockerd\nREST API · images · networks"]
  C["containerd\npull · snapshot · shim"]
  S["containerd-shim\nstdio · exit status"]
  R["runc\nclone · cgroups · exec"]
  K["Linux kernel\nnamespaces · OverlayFS"]
  CLI --> D
  D --> C
  C --> S
  S --> R
  R --> K

containerd sits above runc—it manages image content, creates snapshots (OverlayFS mounts), and supervises shims. containerd-shim keeps the container alive after dockerd disconnects and reports exit codes. This separation is why Kubernetes can use containerd directly without dockerd.

🔬 Under the Hood

On a running host you can find OCI bundles under /run/containerd/io.containerd.runtime.v2.task/ (path varies by version). The bundle's config.json is exactly what runc consumed—inspect it to see real namespace and seccomp settings, not just Dockerfile intent.

📦 Real World

Google helped create containerd and donated it to CNCF. Kubernetes 1.24+ removed built-in dockershim—nodes use containerd or CRI-O directly. Your Dockerfile is unchanged; only the node runtime differs.

🎯 Interview Tip

"Docker vs containerd vs runc?" — runc spawns one container. containerd manages container lifecycle and images. dockerd adds developer UX (build, compose, bridge networks). Kubernetes needs only containerd + CRI.

Rootless Docker

Running the Docker daemon as root means a container escape could mean host root. Rootless mode maps container UID 0 to an unprivileged host user via USER namespaces—dramatically shrinking blast radius.

Why rootless matters

  • Docker daemon compromise ≠ immediate host root access
  • Developers can run containers without sudo on shared machines
  • Aligns with least-privilege and CIS hardening guidance

How it works

A user namespace maps container root (UID 0) to a high unprivileged UID on the host (e.g. 100000). Processes inside the container believe they are root; the host kernel sees them as a normal user. Subuid/subgid ranges in /etc/subuid define the mapping.

Limitations

Limitation Reason Workaround
Cannot bind ports < 1024 Unprivileged users cannot bind privileged ports Use ports ≥ 1024, or sysctl net.ipv4.ip_unprivileged_port_start=80
Some storage drivers unsupported Requires root for device setup Use overlay2 with fuse-overlayfs
AppArmor/SELinux edge cases Profile loading may need root Use default profiles; test before prod
Performance overhead Additional namespace mapping layers Usually negligible for app workloads

Podman: rootless-first alternative

Podman runs daemonless—each podman run is a direct fork/exec via containerd/runc under your user session. Rootless is the default. podman-docker provides a docker CLI alias. On RHEL 8+ and OpenShift developer environments, Podman replaced Docker Engine entirely.

Aspect Docker (rootful) Podman (rootless default)
Daemon Central dockerd (root) None—fork/exec per container
Rootless Opt-in setup Default
Compose docker compose podman compose (compatible)
Kubernetes dockerd removed from nodes Pods via podman generate kube
OCI images Fully compatible Fully compatible—same registries
🔒 Security

Even in rootless mode, never mount sensitive host paths (/etc, /var/run/docker.sock) into containers. A malicious container can still read anything the mapped host user can access.

⚖️ Trade-off

Docker vs Podman vs containerd: Docker wins on developer ergonomics and ecosystem tooling. Podman wins on rootless security and daemonless ops. containerd wins as the shared production runtime under Kubernetes. Architects often standardize on OCI images + one builder, with runtime choice per environment.