Container Internals & Linux Primitives

Linux namespaces

Namespaces isolate a process's view of system resources. The process still shares the host kernel, but believes it has its own PID tree, network stack, mount table, and hostname. Think of namespaces as passports—each container carries credentials that say "you live in world X," even though many worlds share one physical machine.

What namespaces do

Without namespaces, ps aux inside a container would list every process on the host. With PID namespaces, the container sees only its own process tree. NET namespaces give each container its own eth0, routing table, and iptables rules. MNT namespaces control which filesystem paths are visible—this is how the container's root filesystem differs from the host's.

The eight namespace types (Linux 5.6+)

Namespace	Isolates	Docker relevance	Pitfall
PID	Process tree; PID 1 inside container	Container init = PID 1; orphan reaping responsibility	Host PIDs visible with --pid=host
NET	Interfaces, routes, iptables, sockets	Bridge networking, port publishing via NAT	--net=host shares host stack—no isolation
MNT	Filesystem mount points	Container rootfs, volume/bind mounts	Bind mounts can leak host paths into container view
UTS	Hostname and NIS domain name	--hostname flag	Cosmetic only—does not affect DNS resolution by itself
IPC	Shared memory, semaphores, message queues	Isolates /dev/shm between containers	--ipc=host needed for some legacy SHM apps
USER	UID/GID mapping (rootless containers)	Container root maps to unprivileged host user	File ownership on bind mounts must match mapped IDs
CGROUP	View of cgroup hierarchy	Container sees its own cgroup subtree (v2)	Misconfigured delegation breaks resource limits
TIME (5.6+)	Boot time and monotonic clock offsets	Niche: time-travel testing, NTP isolation	Requires CAP_SYS_TIME or appropriate delegation

How Docker creates namespaces

runc calls the clone() syscall (or unshare() for namespace-only changes) with CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC flags. The new process enters fresh namespaces before executing your container entrypoint.

# Run a container and capture its PID on the host
CID=$(docker run -d nginx:alpine)
PID=$(docker inspect --format '{{.State.Pid}}' $CID)

# List namespace links for the container process
ls -la /proc/$PID/ns/

# Compare host init namespace vs container (different inode numbers = isolated)
readlink /proc/1/ns/pid
readlink /proc/$PID/ns/pid

🔬 Under the Hood

Namespace inodes in /proc/<pid>/ns/ are persistent handles. Two processes sharing the same inode number for net are in the same network namespace—useful when debugging sidecar patterns or docker run --network=container:<name>.

🎯 Interview Tip

"What's the difference between a container and a VM?" — VMs virtualize hardware and run a guest kernel. Containers virtualize the process view via namespaces on a shared kernel. Container escape = breaking out of namespace/cgroup/seccomp confinement to affect the host.

cgroups (control groups)

If namespaces are passports, cgroups are resource budgets. They limit, account for, and isolate CPU, memory, I/O, and process count. Without cgroups, a container could fork-bomb the host or consume all RAM.

cgroups v1 vs cgroups v2

Aspect	cgroups v1 (legacy)	cgroups v2 (unified hierarchy)
Hierarchy	Multiple trees per controller (cpu, memory, …)	Single unified tree at /sys/fs/cgroup
Default on	RHEL 7, older Ubuntu	Modern Fedora, Ubuntu 22.04+, Docker 20.10+
Resource delegation	Complex edge cases with systemd	Cleaner delegation for rootless containers
Memory + swap	Separate memory and memsw controllers	Unified memory.max + memory.swap.max

Resource controllers

Controller	Purpose	Docker flag	Kernel file (v2)
cpu	CPU bandwidth limit	--cpus=1.5	cpu.max (quota/period)
cpu.weight	Relative CPU share when contended	--cpu-shares=512	cpu.weight
memory	Hard memory limit	--memory=512m	memory.max
memory.swap	Swap limit (RAM + swap combined)	--memory-swap	memory.swap.max
io	Block I/O weight and limits	--device-read-bps	io.max
pids	Max number of processes	--pids-limit=100	pids.max

What happens when memory limit is exceeded

The kernel's OOM killer selects a process in the cgroup and sends SIGKILL. Docker reports this as OOMKilled in docker inspect. The container's PID 1 dies—often taking the whole container down. Java apps without -XX:+UseContainerSupport may not respect cgroup limits and get OOM-killed unexpectedly.

$ docker run --rm --memory 32m progrium/stress --vm 1 --vm-bytes 64M
stress: FAIL: cannot allocate 67108864 bytes
Killed

$ docker inspect --format '{{.State.OOMKilled}}' <container_id>
true

$ cat /sys/fs/cgroup/system.slice/docker-*.scope/memory.events
low 0
high 0
max 1
oom 1
oom_kill 1

⚠️ Pitfall

Unbounded container resources in a shared environment is a production incident waiting to happen. One memory leak takes down the node. Always set --memory and --cpus—even in Docker Compose via the deploy.resources block.

💡 Pro Tip

--cpus=1.5 maps to CFS quota: cpu.cfs_quota_us = 150000 with cpu.cfs_period_us = 100000 (150% of one core). For JVM containers, pair memory limits with -XX:MaxRAMPercentage=75.0.

Union filesystems

Image layers are read-only diffs stacked like git commits. The container adds a thin writable layer on top. OverlayFS merges them into one coherent filesystem view—the container believes it has a normal root filesystem.

OverlayFS (default storage driver)

Since Linux 3.18, Docker on Linux uses OverlayFS. Four directories matter:

lowerdir — read-only image layers (ordered bottom to top)
upperdir — container's read-write layer (all modifications land here)
workdir — OverlayFS internal scratch space (not visible to container)
merged — unified view mounted as container rootfs

Copy-on-write (CoW)

Reads traverse the stack—upperdir first, then lower layers until the file is found. Writes to a file that exists only in a lower layer trigger a copy into upperdir first (CoW), then the write proceeds. This first-write penalty is why databases should use volumes, not the container layer.

Layer sharing

Ten containers from the same image share identical lower layers on disk—only their upperdirs differ. This is why pulling nginx:alpine once benefits every subsequent container using that image.

# Find the merged mount point for a running container
docker inspect --format '{{.GraphDriver.Data.MergedDir}}' mycontainer

# See overlay mount on the host
mount | grep overlay

# Storage driver in use
docker info --format '{{.Driver}}'

Other storage drivers

Driver	Status	Notes
overlay2	Default (recommended)	OverlayFS, xfs/ext4 backing, mature and fast for reads
devicemapper	Legacy (RHEL 7)	Thin provisioning; deprecated, avoid for new deployments
btrfs / zfs	Niche	Native CoW filesystems; operational complexity limits adoption
vfs	Testing only	No CoW—full copy per layer; extremely slow, no layer sharing

⚖️ Trade-off

Container layer vs volume: container layer is convenient but ephemeral and CoW-expensive for heavy writes. Volumes bypass OverlayFS upperdir overhead and survive docker rm. Bind mounts add host path coupling but enable dev hot-reload.

OverlayFS explorer

Click Read file vs Write file to see where OverlayFS serves data from—and when copy-on-write kicks in.

lowerdir (image layer 2) Read-only · app binaries

lowerdir (image layer 1) Read-only · base OS 📄 /app/config.yml

upperdir (container layer) Read-write · empty until first write

workdir OverlayFS internal · not visible in container

merged (container view) Unified filesystem mounted at / 📄 /app/config.yml — visible here

Read: file found in lowerdir, served without copy
Write: file copied to upperdir (CoW), then modified
Delete: whiteout marker in upperdir hides lower file

runc & the OCI runtime

The Open Container Initiative (OCI) defines portable specs. runc is the reference implementation— a small Go binary that reads a bundle and spawns an isolated process. Docker, containerd, Kubernetes, and Podman all converge here.

Container lifecycle

runc create — set up namespaces, cgroups, rootfs; create container but don't start process
runc start — execute the configured process (your ENTRYPOINT/CMD)
Process runs — PID 1 inside container executes application logic
runc delete — tear down cgroups, unmount, cleanup after exit

The OCI bundle: config.json

An OCI bundle is a directory containing rootfs/ and config.json. The config specifies the process (args, env, cwd), mounts, Linux namespaces, cgroups path, capabilities, and seccomp profile.

{
  "ociVersion": "1.0.2",
  "process": {
    "terminal": false,
    "user": { "uid": 0, "gid": 0 },
    "args": ["/bin/sh", "-c", "nginx -g daemon off;"],
    "env": ["PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"],
    "cwd": "/"
  },
  "root": { "path": "rootfs", "readonly": false },
  "linux": {
    "namespaces": [
      { "type": "pid" },
      { "type": "network" },
      { "type": "mount" },
      { "type": "ipc" },
      { "type": "uts" }
    ],
    "cgroupsPath": "/docker/abc123"
  }
}

Docker architecture layers

flowchart TB
  CLI["docker CLI"]
  D["dockerd\nREST API · images · networks"]
  C["containerd\npull · snapshot · shim"]
  S["containerd-shim\nstdio · exit status"]
  R["runc\nclone · cgroups · exec"]
  K["Linux kernel\nnamespaces · OverlayFS"]
  CLI --> D
  D --> C
  C --> S
  S --> R
  R --> K

containerd sits above runc—it manages image content, creates snapshots (OverlayFS mounts), and supervises shims. containerd-shim keeps the container alive after dockerd disconnects and reports exit codes. This separation is why Kubernetes can use containerd directly without dockerd.

🔬 Under the Hood

On a running host you can find OCI bundles under /run/containerd/io.containerd.runtime.v2.task/ (path varies by version). The bundle's config.json is exactly what runc consumed—inspect it to see real namespace and seccomp settings, not just Dockerfile intent.

📦 Real World

Google helped create containerd and donated it to CNCF. Kubernetes 1.24+ removed built-in dockershim—nodes use containerd or CRI-O directly. Your Dockerfile is unchanged; only the node runtime differs.

🎯 Interview Tip

"Docker vs containerd vs runc?" — runc spawns one container. containerd manages container lifecycle and images. dockerd adds developer UX (build, compose, bridge networks). Kubernetes needs only containerd + CRI.

Rootless Docker

Running the Docker daemon as root means a container escape could mean host root. Rootless mode maps container UID 0 to an unprivileged host user via USER namespaces—dramatically shrinking blast radius.

Why rootless matters

Docker daemon compromise ≠ immediate host root access
Developers can run containers without sudo on shared machines
Aligns with least-privilege and CIS hardening guidance

How it works

A user namespace maps container root (UID 0) to a high unprivileged UID on the host (e.g. 100000). Processes inside the container believe they are root; the host kernel sees them as a normal user. Subuid/subgid ranges in /etc/subuid define the mapping.

Limitations

Limitation	Reason	Workaround
Cannot bind ports < 1024	Unprivileged users cannot bind privileged ports	Use ports ≥ 1024, or sysctl net.ipv4.ip_unprivileged_port_start=80
Some storage drivers unsupported	Requires root for device setup	Use overlay2 with fuse-overlayfs
AppArmor/SELinux edge cases	Profile loading may need root	Use default profiles; test before prod
Performance overhead	Additional namespace mapping layers	Usually negligible for app workloads

Podman: rootless-first alternative

Podman runs daemonless—each podman run is a direct fork/exec via containerd/runc under your user session. Rootless is the default. podman-docker provides a docker CLI alias. On RHEL 8+ and OpenShift developer environments, Podman replaced Docker Engine entirely.

Aspect	Docker (rootful)	Podman (rootless default)
Daemon	Central dockerd (root)	None—fork/exec per container
Rootless	Opt-in setup	Default
Compose	docker compose	podman compose (compatible)
Kubernetes	dockerd removed from nodes	Pods via podman generate kube
OCI images	Fully compatible	Fully compatible—same registries

🔒 Security

Even in rootless mode, never mount sensitive host paths (/etc, /var/run/docker.sock) into containers. A malicious container can still read anything the mapped host user can access.

⚖️ Trade-off

Docker vs Podman vs containerd: Docker wins on developer ergonomics and ecosystem tooling. Podman wins on rootless security and daemonless ops. containerd wins as the shared production runtime under Kubernetes. Architects often standardize on OCI images + one builder, with runtime choice per environment.