Container Security

Container security threat model

Before hardening flags, name what you are defending against. Containers are process isolation on a shared kernel— attackers target escape paths, privilege expansion, poisoned images, and lateral movement across flat networks.

Attack surface map

flowchart TB
  subgraph external [External threats]
    REG[Compromised registry image]
    API[Exposed container API]
  end
  subgraph container [Container boundary]
    APP[Application vulnerability]
    ESC[Kernel escape / breakout]
    PRIV[Privilege escalation inside container]
  end
  subgraph host [Host impact]
    SOCK[docker.sock access]
    HOST[Host filesystem / processes]
    LAT[Lateral movement to other containers]
  end
  REG --> APP
  API --> APP
  APP --> PRIV
  PRIV --> ESC
  ESC --> HOST
  SOCK --> HOST
  HOST --> LAT

Six threat categories

Threat	What happens	Example vector	Primary control
Container escape	Process breaks namespace/cgroup/seccomp confinement and acts on the host	Kernel CVE + --privileged container	Drop capabilities, default seccomp, patch kernel, no privileged mode
Privilege escalation	UID 0 or excessive capabilities inside container enable host-adjacent actions	Writable /etc/passwd, setuid binaries, CAP_SYS_ADMIN	Non-root USER, --cap-drop ALL, read-only rootfs
Supply chain	Malicious or vulnerable code enters via base image, dependency, or build pipeline	Typosquatted image on Docker Hub, compromised maintainer	Digest pinning, signing, SBOM, CI scanning, private registry
Secrets exposure	Credentials persist in image layers, env vars, logs, or inspect output	ENV DB_PASSWORD=... in Dockerfile	Runtime secret mounts, Vault/SM, BuildKit secrets, never ARG/ENV
Resource abuse	Container consumes all CPU/RAM/PIDs and destabilizes co-located workloads	Fork bomb, memory leak without limits	--memory, --cpus, --pids-limit
Lateral movement	Compromised container reaches other services or the Docker API	Flat bridge network + mounted docker.sock	Network segmentation, ICC disabled, no socket mounts, mTLS between services

Trust boundaries

Image → runtime — treat every pulled image as untrusted until scanned and signed
Container → host — assume kernel bugs exist; minimize capabilities that amplify exploits
Container → container — default bridge allows east-west traffic; segment explicitly
Build → registry — CI runners are high-value targets; isolate builders, rotate credentials

🔒 Security

Mounting /var/run/docker.sock equals giving the container root on the host. Any code inside can spawn privileged siblings, read all containers, and exfiltrate secrets from volumes. Ban this pattern in production—use dedicated APIs or operators instead.

🎯 Interview Tip

"How is container security different from VM security?" — VMs have a hypervisor boundary and separate kernels. Containers rely on kernel primitives (namespaces, cgroups, capabilities, seccomp, LSM). Defense in depth means assuming any single layer can fail.

Linux capabilities

Traditional Unix root is all-or-nothing. Linux capabilities split root powers into granular privileges. Docker drops most capabilities by default—but the remaining set is still dangerous if an app is compromised.

Default: what Docker drops vs keeps

Docker starts from the full capability set, then drops everything not in the allowlist. The default container retains only these capabilities (14 on most kernels):

Kept by default	Risk if abused
CAP_CHOWN, CAP_FOWNER	Change file ownership inside container—useful for volume permission fixes
CAP_DAC_OVERRIDE	Bypass file permission checks—dangerous on writable mounts
CAP_SETUID, CAP_SETGID, CAP_SETPCAP	Enable privilege escalation via setuid binaries
CAP_NET_BIND_SERVICE	Bind ports < 1024 without root—needed for nginx on port 80
CAP_NET_RAW	Raw sockets—enables packet sniffing and ARP spoofing inside network namespace
CAP_KILL	Send signals to any process in the same PID namespace
CAP_SYS_CHROOT	Change root directory—needed by some legacy installers
CAP_AUDIT_WRITE, CAP_MKNOD, CAP_FSETID, CAP_SETFCAP	Lower risk in typical app containers; still unnecessary for many workloads

Dropped by default (never available unless explicitly added):

CAP_SYS_ADMIN — mount filesystems, load kernel modules, many namespace operations; primary enabler of container escapes
CAP_NET_ADMIN — configure interfaces, iptables, routing tables
CAP_SYS_PTRACE — debug/trace any process—read memory of other containers on shared PID namespace
CAP_SYS_MODULE, CAP_BPF, CAP_PERFMON — kernel introspection and modification

Production pattern: drop all, add minimum

# Web server binding port 80 as non-root
docker run -d \
  --cap-drop ALL \
  --cap-add NET_BIND_SERVICE \
  --user 1000:1000 \
  --read-only \
  --tmpfs /tmp:rw,noexec,nosuid,size=64m \
  -p 8080:80 \
  nginx:alpine

# Inspect effective capabilities
docker inspect --format '{{.HostConfig.CapDrop}} {{.HostConfig.CapAdd}}' mycontainer

The --privileged flag

--privileged grants all capabilities, disables most seccomp restrictions, and gives access to all host devices. It exists for Docker-in-Docker and hardware debugging—not production apps. CIS Docker Benchmark marks privileged containers as a critical failure.

Inspect capabilities with capsh

$ docker run --rm alpine capsh --print
Current: cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,
         cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,
         cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap=ep
Bounding set =cap_chown,cap_dac_override,...

$ docker run --rm --cap-drop ALL --cap-add NET_BIND_SERVICE alpine capsh --print
Current: cap_net_bind_service=ep
Bounding set =cap_net_bind_service

🔒 Security

CAP_NET_RAW is rarely needed in production. Drop it unless your app genuinely requires raw sockets (e.g. ping utilities). Combined with a bridge network, it enables traffic interception between co-located containers.

⚠️ Pitfall

Adding --cap-add SYS_ADMIN to "fix" mount issues is an anti-pattern. It is equivalent to handing an attacker the keys to the host kernel. Fix permissions with volumes, init containers, or proper USER/chown instead.

💡 Pro Tip

In Kubernetes, mirror Docker's cap-drop pattern with securityContext.capabilities.drop: [ALL] and explicit add. The restricted Pod Security Standard enforces this cluster-wide.

Seccomp profiles

Seccomp (secure computing mode) filters syscalls at the kernel boundary. Even with dropped capabilities, dangerous syscalls like mount, ptrace, or keyctl can enable escapes. Docker's default profile blocks ~44 syscalls.

Default Docker seccomp profile

Unless overridden, Docker applies the default seccomp profile (shipped as /etc/docker/seccomp.json on Linux or embedded in dockerd). It uses SCMP_ACT_ERRNO to deny dangerous syscalls while allowing normal application behavior.

Blocked syscall (sample)	Why it is blocked
mount, umount2, pivot_root	Filesystem manipulation—escape via mounting host paths
ptrace	Debug other processes—read secrets from co-located workloads
reboot, kexec_load	Host lifecycle control
swapon, swapoff	Modify host swap configuration
init_module, finit_module	Load kernel modules
acct, settimeofday	System accounting and clock changes
clone with unsafe flags	Prevent namespace creation that bypasses isolation

The full default profile blocks approximately 44 syscalls (plus flag-conditioned rules for clone, socket, and arch_prctl). Architectures may vary slightly.

Custom seccomp JSON

When an application legitimately needs a blocked syscall (rare), create a custom profile that inherits the default and adds explicit allows—never start from an empty allowlist.

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_X86", "SCMP_ARCH_X32"],
  "syscalls": [
    {
      "names": ["personality"],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

# Apply custom profile
docker run --rm --security-opt seccomp=/path/to/custom-seccomp.json myapp

# Verify profile in OCI bundle (running container)
docker inspect --format '{{.HostConfig.SecurityOpt}}' mycontainer

seccomp=unconfined — the danger

--security-opt seccomp=unconfined disables syscall filtering entirely. Combined with any kernel vulnerability, the container has maximum syscall reach. --privileged implies unconfined seccomp. Some legacy apps request unconfined during migration—treat it as technical debt with a sunset date.

🔒 Security

Never run seccomp=unconfined in production. If a container fails with "operation not permitted," identify the specific blocked syscall with strace, then add a surgical allow rule—do not disable the entire filter.

🔬 Under the Hood

Seccomp profiles live in the OCI config.json under linux.seccomp. runc installs the filter before exec. You can diff intended vs actual policy by inspecting the bundle at /run/containerd/io.containerd.runtime.v2.task/.

AppArmor & SELinux

Capabilities and seccomp limit what a process can do. Linux Security Modules (LSM) like AppArmor and SELinux limit what a process can access—which files, sockets, and capabilities match policy labels.

AppArmor (Ubuntu, Debian, SUSE)

Docker automatically applies the docker-default AppArmor profile on supported hosts. It restricts mount operations, raw network access, /proc writes, and ptrace across containers.

# Check AppArmor profile on a running container
docker inspect --format '{{.AppArmorProfile}}' mycontainer
# Expected: docker-default (or custom profile name)

# Explicitly set profile
docker run --security-opt apparmor=docker-default nginx:alpine

# Disable AppArmor (DANGEROUS — debugging only)
docker run --security-opt apparmor=unconfined alpine

SELinux (RHEL, Fedora, CentOS)

On SELinux-enforcing hosts, Docker labels containers with container_t (process type) and content with container_file_t. Bind mounts inherit container_file_t or custom labels via :z / :Z volume flags.

Volume suffix	Behavior	Use when
:z	Shared content label—multiple containers can read/write	Shared data directories across containers on same host
:Z	Private content label—exclusive to one container	Dedicated volume per container (preferred default)

# SELinux: private volume label
docker run -v /data/app:/app:Z myapp

# Custom SELinux level (multi-tenant MLS environments)
docker run --security-opt label=level:s0:c100,c200 myapp

# Check process context
docker exec mycontainer cat /proc/1/attr/current

--security-opt reference

Option	LSM	Production guidance
apparmor=docker-default	AppArmor	Default on Ubuntu—keep it; do not set unconfined
label=type:container_t	SELinux	Default container type on RHEL—verify with ps -eZ
seccomp=default	Seccomp	Explicit default (same as omitting the flag)
no-new-privileges:true	Both	Prevents setuid escalation—pair with non-root USER

🔒 Security

apparmor=unconfined and label=disable remove mandatory access control. Only use during local debugging on disposable VMs. Production containers should run with enforcing LSM and no-new-privileges:true.

📦 Real World

OpenShift runs containers with restricted SCCs that enforce SELinux container_t, dropped capabilities, and non-root UIDs by default. If your image works on plain Docker but fails on OpenShift, check USER and volume :Z labels first.

Running as non-root

Container UID 0 maps to host root in rootful Docker. Even with namespaces, a kernel escape while running as root inside the container gives the attacker maximum leverage. Run as an unprivileged numeric UID everywhere.

Dockerfile: USER instruction

FROM eclipse-temurin:21-jre-alpine

# Create dedicated group and user (UID/GID >= 10000)
RUN addgroup -g 10001 -S appgroup && \
    adduser -u 10001 -S appuser -G appgroup

WORKDIR /app
COPY --chown=appuser:appgroup target/app.jar app.jar

USER 10001:10001
ENTRYPOINT ["java", "-jar", "app.jar"]

Why numeric UID matters

Use numeric UIDs (USER 10001:10001) instead of usernames. The username may not exist in the runtime image (multi-stage builds drop /etc/passwd entries), but the numeric UID always resolves. Kubernetes runAsUser: 10001 matches exactly.

Runtime override: --user

# Override image USER at runtime
docker run --user 10001:10001 myapp

# Verify effective user inside container
docker exec mycontainer id
# uid=10001 gid=10001

chown at build time

Files created by COPY or RUN as root are owned by root. Use COPY --chown=appuser:appgroup or a RUN chown -R before switching USER. Writable paths (logs, caches, temp) must be owned by the runtime user—or mounted via volumes/tmpfs.

Read-only root filesystem + tmpfs

docker run -d \
  --read-only \
  --tmpfs /tmp:rw,noexec,nosuid,size=128m \
  --tmpfs /var/run:rw,noexec,nosuid,size=16m \
  --user 10001:10001 \
  --cap-drop ALL \
  myapp

--read-only makes the container layer immutable—attackers cannot drop binaries or modify configs. Mount tmpfs for paths the app must write. Use noexec,nosuid mount options to block executed payloads in temp directories.

Why root is still dangerous

Kernel escape amplification — root + CVE = host root in rootful mode
Writable application dirs — root can modify binaries, cron, and startup scripts inside the container
Package managers — apt install in production images adds attack surface at runtime
Bind mount ownership — root in container may chown host-mounted files if CAP_CHOWN is present
Compliance — PCI-DSS, SOC2, and CIS benchmarks require non-root containers

🔒 Security

Distroless and scratch images have no shell. Combine non-root USER with distroless bases so attackers cannot docker exec into a shell even if they exploit the app. Debugging uses ephemeral debug containers (Kubernetes kubectl debug).

⚠️ Pitfall

Switching to USER app before COPY breaks builds that need root for package installs. Pattern: do all root operations in early layers, chown, then USER as the final instruction before ENTRYPOINT.

Secrets management

Image layers are immutable and inspectable. Any secret baked into a layer survives docker history, registry exports, and backup tapes. Inject credentials at runtime, never at build time via ARG or ENV.

Never use ARG or ENV for secrets

# ❌ ANTI-PATTERN — secret persists in image history forever
ARG DB_PASSWORD
ENV DB_PASSWORD=${DB_PASSWORD}
RUN curl -u admin:${DB_PASSWORD} https://internal/api

# Anyone can recover it:
# docker history --no-trunc myapp:latest

🔒 Security

ENV values appear in docker inspect, process listings (/proc/1/environ), and crash dumps. CI systems that pass --build-arg for npm tokens or API keys have leaked credentials in public Docker Hub images—scan your image history in every PR.

BuildKit secret mount (build-time only)

When a build must reach a private registry or API, use BuildKit secret mounts. Secrets are available only during the RUN line—they never commit to a layer.

# syntax=docker/dockerfile:1
FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN --mount=type=secret,id=npmrc,target=/root/.npmrc \
    npm ci --omit=dev
COPY . .
RUN npm run build

# Pass secret from file at build time (not stored in image)
DOCKER_BUILDKIT=1 docker build \
  --secret id=npmrc,src=$HOME/.npmrc \
  -t myapp .

Runtime secret delivery

Mechanism	How it works	Best for
Docker Swarm secrets	Encrypted at rest, mounted as files in /run/secrets/	Swarm deployments; file-based app config
HashiCorp Vault	Dynamic credentials with TTL; AppRole/K8s auth	Short-lived DB passwords, PKI certificates
AWS Secrets Manager	Central store with rotation Lambda; IAM-scoped access	ECS/EKS workloads on AWS; automatic rotation
Azure Key Vault / GCP SM	Cloud-native secret stores with audit logging	Multi-service cloud estates
Environment injection (orchestrator)	K8s Secrets / ECS task defs inject at start—not in image	12-factor apps; pair with encryption at rest

# Docker Compose — Swarm secrets (deployed stack)
services:
  api:
    image: myapp:latest
    secrets:
      - db_password
    environment:
      DB_PASSWORD_FILE: /run/secrets/db_password

secrets:
  db_password:
    external: true

Rotation

Prefer short-lived credentials — Vault dynamic secrets or AWS SM rotation over static files
Reload without rebuild — app reads secret file on interval or receives SIGHUP; no image rebuild for rotation
Dual-write window — during rotation, accept both old and new credentials for a bounded period
Audit access — log every secret read with principal identity; alert on anomalous patterns
Revoke on incident — runbook to invalidate tokens, roll pods/containers, and scan registry for leaked layers

🔒 Security

Never commit .env files to git. Add them to .dockerignore and .gitignore. Use git-secrets or trufflehog in CI to catch accidental credential commits before they reach the build context.

💡 Pro Tip

Read secrets from files not environment variables when possible—file-based injection (/run/secrets/db_password) avoids exposure in ps e and core dumps. Spring Boot supports spring.config.import=optional:file:/run/secrets/.

Network security

Default bridge networking allows containers to reach each other by IP. Without segmentation, one compromised container can scan and attack every co-located service. Treat container networks as zero-trust east-west paths.

Inter-container communication (ICC)

By default, containers on the same bridge can ping each other. Disable ICC on the docker0 bridge when containers do not need direct peer access (legacy standalone Docker hosts):

// /etc/docker/daemon.json
{
  "icc": false
}

In Compose and Swarm, use multiple networks to segment tiers—frontend on web, API on backend, database only attached to backend.

services:
  web:
    networks: [frontend]
  api:
    networks: [frontend, backend]
  db:
    networks: [backend]   # not reachable from web tier

networks:
  frontend:
  backend:
    internal: true       # no external routing

Network segmentation patterns

Pattern	Isolation level	When to use
Multi-bridge (Compose networks)	Logical tier separation	Default starting point for multi-service stacks
internal: true	No outbound internet on network	Databases, message brokers—no external egress needed
Overlay + encrypted	Cross-host with IPsec mesh (Swarm)	Multi-host Swarm clusters
Macvlan / IPvlan	Container on physical network segment	Legacy apps needing LAN IPs; strict firewall rules required
Service mesh (K8s)	mTLS + L7 policy per workload	Production Kubernetes—NetworkPolicy + Istio/Linkerd

Bind to loopback, not 0.0.0.0

Applications listening on 0.0.0.0 accept connections from any container on the shared network. Configure apps to bind 127.0.0.1 for admin/metrics endpoints, and expose only through a reverse proxy with authentication. Use -p 127.0.0.1:8080:8080 to limit host-side exposure during local dev.

No privileged or host networking in production

Flag	Risk	Production stance
--net=host	Container shares host network stack—no namespace isolation; can bind host ports, sniff traffic	Ban in prod; use port publishing or ingress controller
--privileged	Full capabilities + device access + unconfined seccomp	Ban in prod; use specific cap-add if absolutely required
--pid=host	See and signal all host processes	Monitoring agents only; isolate with strict RBAC
--ipc=host	Shared memory with host and other containers	Legacy SHM apps only; document exception

🔒 Security

Published ports (-p 8080:80) expose services to the host network interface. On cloud VMs, security groups must restrict source IPs. Bind to 127.0.0.1 if only local access is needed. Never publish Docker daemon port 2375 without TLS.

🔒 Security

DNS on the default bridge resolves container names. A compromised container can enumerate peers via DNS and port scanning. Segment networks so databases are unreachable from front-end containers that face users.

⚖️ Trade-off

--net=host reduces NAT overhead for high-throughput networking (video, HFT). If you must use it, run on dedicated bare-metal nodes with no other tenants, strict SELinux, and automated compliance scanning.

Image hardening checklist

Gate every image before production. This checklist consolidates build-time, runtime, and operational controls— use it in PR reviews, CI pipelines, and release sign-off.

Control	Requirement	Status
✅ Non-root user	USER with numeric UID ≥ 10000; no runtime root	Pass / Fail
✅ Capabilities dropped	--cap-drop ALL; only explicit --cap-add for required caps	Pass / Fail
✅ No privileged mode	--privileged absent from prod manifests	Pass / Fail
✅ Seccomp enforced	Default profile or custom JSON—never unconfined	Pass / Fail
✅ LSM active	AppArmor docker-default or SELinux container_t	Pass / Fail
✅ Read-only rootfs	--read-only with tmpfs/volumes for writable paths	Pass / Fail
✅ No new privileges	--security-opt no-new-privileges:true	Pass / Fail
✅ Minimal base image	distroless, alpine, slim, or UBI micro—no full OS desktop	Pass / Fail
✅ Multi-stage build	Build tools, compilers, and test deps excluded from final stage	Pass / Fail
✅ Base image pinned	Tag + digest (@sha256:…)—not bare :latest	Pass / Fail
✅ No secrets in layers	No ARG/ENV credentials; BuildKit secrets for build-time needs	Pass / Fail
✅ .dockerignore present	Excludes .git, .env, keys, and build artifacts	Pass / Fail
✅ Resource limits set	--memory, --cpus, --pids-limit configured	Pass / Fail
✅ HEALTHCHECK defined	Dockerfile or orchestrator probe on real readiness path	Pass / Fail
✅ Image scanned in CI	Trivy, Grype, or Docker Scout—no unresolved critical CVEs	Pass / Fail
✅ Image signed	Cosign/Sigstore signature verified at deploy	Pass / Fail
✅ SBOM generated	SPDX or CycloneDX attached for supply chain audit	Pass / Fail
✅ No docker.sock mount	Socket not mounted into application containers	Pass / Fail
✅ No sensitive host binds	/etc, /root, /var/run not mounted	Pass / Fail
✅ Network segmented	DB on internal network; ICC restricted; no --net=host	Pass / Fail
✅ Loopback bind for admin	Metrics/debug on 127.0.0.1—not 0.0.0.0	Pass / Fail
✅ Secrets at runtime	Vault, AWS SM, or Swarm/K8s secrets—rotatable without rebuild	Pass / Fail
✅ Logging without secrets	App and sidecar logs redact tokens, passwords, PII	Pass / Fail
✅ Incident runbook	Documented steps for CVE response, key rotation, compromised image	Pass / Fail

Quick validation commands

# Scan image for CVEs
trivy image --severity HIGH,CRITICAL myapp:latest

# Check image history for leaked secrets
docker history --no-trunc myapp:latest | grep -iE 'password|secret|key|token'

# Verify non-root and security opts at runtime
docker inspect mycontainer --format 'User={{.Config.User}} Privileged={{.HostConfig.Privileged}} ReadonlyRootfs={{.HostConfig.ReadonlyRootfs}} SecurityOpt={{.HostConfig.SecurityOpt}}'

# Lint Dockerfile before build
hadolint Dockerfile

🔒 Security

Make the checklist a CI gate, not a wiki page. Fail builds on hadolint warnings, Trivy critical CVEs, and root USER in the final stage. Architect approval required for any checklist exception—with expiry date and compensating controls documented.

🎯 Interview Tip

"How do you secure containers?" — Walk the layers: supply chain (pin, scan, sign), image (non-root, minimal), runtime (cap-drop, seccomp, read-only), secrets (runtime injection), network (segmentation). Mention that containers are not VMs and kernel sharing means defense in depth is mandatory.