Container Security
Containers share the host kernel—security is layered confinement, not perimeter magic. Every production workload needs a threat model, dropped capabilities, seccomp enforcement, non-root execution, secret hygiene, and network segmentation. Skip any layer and a single CVE becomes a host compromise.
Container security threat model
Before hardening flags, name what you are defending against. Containers are process isolation on a shared kernel— attackers target escape paths, privilege expansion, poisoned images, and lateral movement across flat networks.
Attack surface map
flowchart TB
subgraph external [External threats]
REG[Compromised registry image]
API[Exposed container API]
end
subgraph container [Container boundary]
APP[Application vulnerability]
ESC[Kernel escape / breakout]
PRIV[Privilege escalation inside container]
end
subgraph host [Host impact]
SOCK[docker.sock access]
HOST[Host filesystem / processes]
LAT[Lateral movement to other containers]
end
REG --> APP
API --> APP
APP --> PRIV
PRIV --> ESC
ESC --> HOST
SOCK --> HOST
HOST --> LAT
Six threat categories
| Threat | What happens | Example vector | Primary control |
|---|---|---|---|
| Container escape | Process breaks namespace/cgroup/seccomp confinement and acts on the host | Kernel CVE + --privileged container | Drop capabilities, default seccomp, patch kernel, no privileged mode |
| Privilege escalation | UID 0 or excessive capabilities inside container enable host-adjacent actions | Writable /etc/passwd, setuid binaries, CAP_SYS_ADMIN | Non-root USER, --cap-drop ALL, read-only rootfs |
| Supply chain | Malicious or vulnerable code enters via base image, dependency, or build pipeline | Typosquatted image on Docker Hub, compromised maintainer | Digest pinning, signing, SBOM, CI scanning, private registry |
| Secrets exposure | Credentials persist in image layers, env vars, logs, or inspect output | ENV DB_PASSWORD=... in Dockerfile | Runtime secret mounts, Vault/SM, BuildKit secrets, never ARG/ENV |
| Resource abuse | Container consumes all CPU/RAM/PIDs and destabilizes co-located workloads | Fork bomb, memory leak without limits | --memory, --cpus, --pids-limit |
| Lateral movement | Compromised container reaches other services or the Docker API | Flat bridge network + mounted docker.sock | Network segmentation, ICC disabled, no socket mounts, mTLS between services |
Trust boundaries
- Image → runtime — treat every pulled image as untrusted until scanned and signed
- Container → host — assume kernel bugs exist; minimize capabilities that amplify exploits
- Container → container — default bridge allows east-west traffic; segment explicitly
- Build → registry — CI runners are high-value targets; isolate builders, rotate credentials
Mounting /var/run/docker.sock equals giving the container root on the host. Any code inside can spawn privileged siblings, read all containers, and exfiltrate secrets from volumes. Ban this pattern in production—use dedicated APIs or operators instead.
"How is container security different from VM security?" — VMs have a hypervisor boundary and separate kernels. Containers rely on kernel primitives (namespaces, cgroups, capabilities, seccomp, LSM). Defense in depth means assuming any single layer can fail.
Linux capabilities
Traditional Unix root is all-or-nothing. Linux capabilities split root powers into granular privileges. Docker drops most capabilities by default—but the remaining set is still dangerous if an app is compromised.
Default: what Docker drops vs keeps
Docker starts from the full capability set, then drops everything not in the allowlist. The default container retains only these capabilities (14 on most kernels):
| Kept by default | Risk if abused |
|---|---|
| CAP_CHOWN, CAP_FOWNER | Change file ownership inside container—useful for volume permission fixes |
| CAP_DAC_OVERRIDE | Bypass file permission checks—dangerous on writable mounts |
| CAP_SETUID, CAP_SETGID, CAP_SETPCAP | Enable privilege escalation via setuid binaries |
| CAP_NET_BIND_SERVICE | Bind ports < 1024 without root—needed for nginx on port 80 |
| CAP_NET_RAW | Raw sockets—enables packet sniffing and ARP spoofing inside network namespace |
| CAP_KILL | Send signals to any process in the same PID namespace |
| CAP_SYS_CHROOT | Change root directory—needed by some legacy installers |
| CAP_AUDIT_WRITE, CAP_MKNOD, CAP_FSETID, CAP_SETFCAP | Lower risk in typical app containers; still unnecessary for many workloads |
Dropped by default (never available unless explicitly added):
- CAP_SYS_ADMIN — mount filesystems, load kernel modules, many namespace operations; primary enabler of container escapes
- CAP_NET_ADMIN — configure interfaces, iptables, routing tables
- CAP_SYS_PTRACE — debug/trace any process—read memory of other containers on shared PID namespace
- CAP_SYS_MODULE, CAP_BPF, CAP_PERFMON — kernel introspection and modification
Production pattern: drop all, add minimum
# Web server binding port 80 as non-root
docker run -d \
--cap-drop ALL \
--cap-add NET_BIND_SERVICE \
--user 1000:1000 \
--read-only \
--tmpfs /tmp:rw,noexec,nosuid,size=64m \
-p 8080:80 \
nginx:alpine
# Inspect effective capabilities
docker inspect --format '{{.HostConfig.CapDrop}} {{.HostConfig.CapAdd}}' mycontainer
The --privileged flag
--privileged grants all capabilities, disables most seccomp restrictions, and gives access to all host devices. It exists for Docker-in-Docker and hardware debugging—not production apps. CIS Docker Benchmark marks privileged containers as a critical failure.
Inspect capabilities with capsh
$ docker run --rm alpine capsh --print Current: cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill, cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw, cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap=ep Bounding set =cap_chown,cap_dac_override,... $ docker run --rm --cap-drop ALL --cap-add NET_BIND_SERVICE alpine capsh --print Current: cap_net_bind_service=ep Bounding set =cap_net_bind_service
CAP_NET_RAW is rarely needed in production. Drop it unless your app genuinely requires raw sockets (e.g. ping utilities). Combined with a bridge network, it enables traffic interception between co-located containers.
Adding --cap-add SYS_ADMIN to "fix" mount issues is an anti-pattern. It is equivalent to handing an attacker the keys to the host kernel. Fix permissions with volumes, init containers, or proper USER/chown instead.
In Kubernetes, mirror Docker's cap-drop pattern with securityContext.capabilities.drop: [ALL] and explicit add. The restricted Pod Security Standard enforces this cluster-wide.
Seccomp profiles
Seccomp (secure computing mode) filters syscalls at the kernel boundary. Even with dropped capabilities, dangerous syscalls like mount, ptrace, or keyctl can enable escapes. Docker's default profile blocks ~44 syscalls.
Default Docker seccomp profile
Unless overridden, Docker applies the default seccomp profile (shipped as /etc/docker/seccomp.json on Linux or embedded in dockerd). It uses SCMP_ACT_ERRNO to deny dangerous syscalls while allowing normal application behavior.
| Blocked syscall (sample) | Why it is blocked |
|---|---|
| mount, umount2, pivot_root | Filesystem manipulation—escape via mounting host paths |
| ptrace | Debug other processes—read secrets from co-located workloads |
| reboot, kexec_load | Host lifecycle control |
| swapon, swapoff | Modify host swap configuration |
| init_module, finit_module | Load kernel modules |
| acct, settimeofday | System accounting and clock changes |
| clone with unsafe flags | Prevent namespace creation that bypasses isolation |
The full default profile blocks approximately 44 syscalls (plus flag-conditioned rules for clone, socket, and arch_prctl). Architectures may vary slightly.
Custom seccomp JSON
When an application legitimately needs a blocked syscall (rare), create a custom profile that inherits the default and adds explicit allows—never start from an empty allowlist.
{
"defaultAction": "SCMP_ACT_ERRNO",
"architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_X86", "SCMP_ARCH_X32"],
"syscalls": [
{
"names": ["personality"],
"action": "SCMP_ACT_ALLOW"
}
]
}
# Apply custom profile
docker run --rm --security-opt seccomp=/path/to/custom-seccomp.json myapp
# Verify profile in OCI bundle (running container)
docker inspect --format '{{.HostConfig.SecurityOpt}}' mycontainer
seccomp=unconfined — the danger
--security-opt seccomp=unconfined disables syscall filtering entirely. Combined with any kernel vulnerability, the container has maximum syscall reach. --privileged implies unconfined seccomp. Some legacy apps request unconfined during migration—treat it as technical debt with a sunset date.
Never run seccomp=unconfined in production. If a container fails with "operation not permitted," identify the specific blocked syscall with strace, then add a surgical allow rule—do not disable the entire filter.
Seccomp profiles live in the OCI config.json under linux.seccomp. runc installs the filter before exec. You can diff intended vs actual policy by inspecting the bundle at /run/containerd/io.containerd.runtime.v2.task/.
AppArmor & SELinux
Capabilities and seccomp limit what a process can do. Linux Security Modules (LSM) like AppArmor and SELinux limit what a process can access—which files, sockets, and capabilities match policy labels.
AppArmor (Ubuntu, Debian, SUSE)
Docker automatically applies the docker-default AppArmor profile on supported hosts. It restricts mount operations, raw network access, /proc writes, and ptrace across containers.
# Check AppArmor profile on a running container
docker inspect --format '{{.AppArmorProfile}}' mycontainer
# Expected: docker-default (or custom profile name)
# Explicitly set profile
docker run --security-opt apparmor=docker-default nginx:alpine
# Disable AppArmor (DANGEROUS — debugging only)
docker run --security-opt apparmor=unconfined alpine
SELinux (RHEL, Fedora, CentOS)
On SELinux-enforcing hosts, Docker labels containers with container_t (process type) and content with container_file_t. Bind mounts inherit container_file_t or custom labels via :z / :Z volume flags.
| Volume suffix | Behavior | Use when |
|---|---|---|
| :z | Shared content label—multiple containers can read/write | Shared data directories across containers on same host |
| :Z | Private content label—exclusive to one container | Dedicated volume per container (preferred default) |
# SELinux: private volume label
docker run -v /data/app:/app:Z myapp
# Custom SELinux level (multi-tenant MLS environments)
docker run --security-opt label=level:s0:c100,c200 myapp
# Check process context
docker exec mycontainer cat /proc/1/attr/current
--security-opt reference
| Option | LSM | Production guidance |
|---|---|---|
| apparmor=docker-default | AppArmor | Default on Ubuntu—keep it; do not set unconfined |
| label=type:container_t | SELinux | Default container type on RHEL—verify with ps -eZ |
| seccomp=default | Seccomp | Explicit default (same as omitting the flag) |
| no-new-privileges:true | Both | Prevents setuid escalation—pair with non-root USER |
apparmor=unconfined and label=disable remove mandatory access control. Only use during local debugging on disposable VMs. Production containers should run with enforcing LSM and no-new-privileges:true.
OpenShift runs containers with restricted SCCs that enforce SELinux container_t, dropped capabilities, and non-root UIDs by default. If your image works on plain Docker but fails on OpenShift, check USER and volume :Z labels first.
Running as non-root
Container UID 0 maps to host root in rootful Docker. Even with namespaces, a kernel escape while running as root inside the container gives the attacker maximum leverage. Run as an unprivileged numeric UID everywhere.
Dockerfile: USER instruction
FROM eclipse-temurin:21-jre-alpine
# Create dedicated group and user (UID/GID >= 10000)
RUN addgroup -g 10001 -S appgroup && \
adduser -u 10001 -S appuser -G appgroup
WORKDIR /app
COPY --chown=appuser:appgroup target/app.jar app.jar
USER 10001:10001
ENTRYPOINT ["java", "-jar", "app.jar"]
Why numeric UID matters
Use numeric UIDs (USER 10001:10001) instead of usernames. The username may not exist in the runtime image (multi-stage builds drop /etc/passwd entries), but the numeric UID always resolves. Kubernetes runAsUser: 10001 matches exactly.
Runtime override: --user
# Override image USER at runtime
docker run --user 10001:10001 myapp
# Verify effective user inside container
docker exec mycontainer id
# uid=10001 gid=10001
chown at build time
Files created by COPY or RUN as root are owned by root. Use COPY --chown=appuser:appgroup or a RUN chown -R before switching USER. Writable paths (logs, caches, temp) must be owned by the runtime user—or mounted via volumes/tmpfs.
Read-only root filesystem + tmpfs
docker run -d \
--read-only \
--tmpfs /tmp:rw,noexec,nosuid,size=128m \
--tmpfs /var/run:rw,noexec,nosuid,size=16m \
--user 10001:10001 \
--cap-drop ALL \
myapp
--read-only makes the container layer immutable—attackers cannot drop binaries or modify configs. Mount tmpfs for paths the app must write. Use noexec,nosuid mount options to block executed payloads in temp directories.
Why root is still dangerous
- Kernel escape amplification — root + CVE = host root in rootful mode
- Writable application dirs — root can modify binaries, cron, and startup scripts inside the container
- Package managers — apt install in production images adds attack surface at runtime
- Bind mount ownership — root in container may chown host-mounted files if CAP_CHOWN is present
- Compliance — PCI-DSS, SOC2, and CIS benchmarks require non-root containers
Distroless and scratch images have no shell. Combine non-root USER with distroless bases so attackers cannot docker exec into a shell even if they exploit the app. Debugging uses ephemeral debug containers (Kubernetes kubectl debug).
Switching to USER app before COPY breaks builds that need root for package installs. Pattern: do all root operations in early layers, chown, then USER as the final instruction before ENTRYPOINT.
Secrets management
Image layers are immutable and inspectable. Any secret baked into a layer survives docker history, registry exports, and backup tapes. Inject credentials at runtime, never at build time via ARG or ENV.
Never use ARG or ENV for secrets
# ❌ ANTI-PATTERN — secret persists in image history forever
ARG DB_PASSWORD
ENV DB_PASSWORD=${DB_PASSWORD}
RUN curl -u admin:${DB_PASSWORD} https://internal/api
# Anyone can recover it:
# docker history --no-trunc myapp:latest
ENV values appear in docker inspect, process listings (/proc/1/environ), and crash dumps. CI systems that pass --build-arg for npm tokens or API keys have leaked credentials in public Docker Hub images—scan your image history in every PR.
BuildKit secret mount (build-time only)
When a build must reach a private registry or API, use BuildKit secret mounts. Secrets are available only during the RUN line—they never commit to a layer.
# syntax=docker/dockerfile:1
FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN --mount=type=secret,id=npmrc,target=/root/.npmrc \
npm ci --omit=dev
COPY . .
RUN npm run build
# Pass secret from file at build time (not stored in image)
DOCKER_BUILDKIT=1 docker build \
--secret id=npmrc,src=$HOME/.npmrc \
-t myapp .
Runtime secret delivery
| Mechanism | How it works | Best for |
|---|---|---|
| Docker Swarm secrets | Encrypted at rest, mounted as files in /run/secrets/ | Swarm deployments; file-based app config |
| HashiCorp Vault | Dynamic credentials with TTL; AppRole/K8s auth | Short-lived DB passwords, PKI certificates |
| AWS Secrets Manager | Central store with rotation Lambda; IAM-scoped access | ECS/EKS workloads on AWS; automatic rotation |
| Azure Key Vault / GCP SM | Cloud-native secret stores with audit logging | Multi-service cloud estates |
| Environment injection (orchestrator) | K8s Secrets / ECS task defs inject at start—not in image | 12-factor apps; pair with encryption at rest |
# Docker Compose — Swarm secrets (deployed stack)
services:
api:
image: myapp:latest
secrets:
- db_password
environment:
DB_PASSWORD_FILE: /run/secrets/db_password
secrets:
db_password:
external: true
Rotation
- Prefer short-lived credentials — Vault dynamic secrets or AWS SM rotation over static files
- Reload without rebuild — app reads secret file on interval or receives SIGHUP; no image rebuild for rotation
- Dual-write window — during rotation, accept both old and new credentials for a bounded period
- Audit access — log every secret read with principal identity; alert on anomalous patterns
- Revoke on incident — runbook to invalidate tokens, roll pods/containers, and scan registry for leaked layers
Never commit .env files to git. Add them to .dockerignore and .gitignore. Use git-secrets or trufflehog in CI to catch accidental credential commits before they reach the build context.
Read secrets from files not environment variables when possible—file-based injection (/run/secrets/db_password) avoids exposure in ps e and core dumps. Spring Boot supports spring.config.import=optional:file:/run/secrets/.
Network security
Default bridge networking allows containers to reach each other by IP. Without segmentation, one compromised container can scan and attack every co-located service. Treat container networks as zero-trust east-west paths.
Inter-container communication (ICC)
By default, containers on the same bridge can ping each other. Disable ICC on the docker0 bridge when containers do not need direct peer access (legacy standalone Docker hosts):
// /etc/docker/daemon.json
{
"icc": false
}
In Compose and Swarm, use multiple networks to segment tiers—frontend on web, API on backend, database only attached to backend.
services:
web:
networks: [frontend]
api:
networks: [frontend, backend]
db:
networks: [backend] # not reachable from web tier
networks:
frontend:
backend:
internal: true # no external routing
Network segmentation patterns
| Pattern | Isolation level | When to use |
|---|---|---|
| Multi-bridge (Compose networks) | Logical tier separation | Default starting point for multi-service stacks |
| internal: true | No outbound internet on network | Databases, message brokers—no external egress needed |
| Overlay + encrypted | Cross-host with IPsec mesh (Swarm) | Multi-host Swarm clusters |
| Macvlan / IPvlan | Container on physical network segment | Legacy apps needing LAN IPs; strict firewall rules required |
| Service mesh (K8s) | mTLS + L7 policy per workload | Production Kubernetes—NetworkPolicy + Istio/Linkerd |
Bind to loopback, not 0.0.0.0
Applications listening on 0.0.0.0 accept connections from any container on the shared network. Configure apps to bind 127.0.0.1 for admin/metrics endpoints, and expose only through a reverse proxy with authentication. Use -p 127.0.0.1:8080:8080 to limit host-side exposure during local dev.
No privileged or host networking in production
| Flag | Risk | Production stance |
|---|---|---|
| --net=host | Container shares host network stack—no namespace isolation; can bind host ports, sniff traffic | Ban in prod; use port publishing or ingress controller |
| --privileged | Full capabilities + device access + unconfined seccomp | Ban in prod; use specific cap-add if absolutely required |
| --pid=host | See and signal all host processes | Monitoring agents only; isolate with strict RBAC |
| --ipc=host | Shared memory with host and other containers | Legacy SHM apps only; document exception |
Published ports (-p 8080:80) expose services to the host network interface. On cloud VMs, security groups must restrict source IPs. Bind to 127.0.0.1 if only local access is needed. Never publish Docker daemon port 2375 without TLS.
DNS on the default bridge resolves container names. A compromised container can enumerate peers via DNS and port scanning. Segment networks so databases are unreachable from front-end containers that face users.
--net=host reduces NAT overhead for high-throughput networking (video, HFT). If you must use it, run on dedicated bare-metal nodes with no other tenants, strict SELinux, and automated compliance scanning.
Image hardening checklist
Gate every image before production. This checklist consolidates build-time, runtime, and operational controls— use it in PR reviews, CI pipelines, and release sign-off.
| Control | Requirement | Status |
|---|---|---|
| ✅ Non-root user | USER with numeric UID ≥ 10000; no runtime root | Pass / Fail |
| ✅ Capabilities dropped | --cap-drop ALL; only explicit --cap-add for required caps | Pass / Fail |
| ✅ No privileged mode | --privileged absent from prod manifests | Pass / Fail |
| ✅ Seccomp enforced | Default profile or custom JSON—never unconfined | Pass / Fail |
| ✅ LSM active | AppArmor docker-default or SELinux container_t | Pass / Fail |
| ✅ Read-only rootfs | --read-only with tmpfs/volumes for writable paths | Pass / Fail |
| ✅ No new privileges | --security-opt no-new-privileges:true | Pass / Fail |
| ✅ Minimal base image | distroless, alpine, slim, or UBI micro—no full OS desktop | Pass / Fail |
| ✅ Multi-stage build | Build tools, compilers, and test deps excluded from final stage | Pass / Fail |
| ✅ Base image pinned | Tag + digest (@sha256:…)—not bare :latest | Pass / Fail |
| ✅ No secrets in layers | No ARG/ENV credentials; BuildKit secrets for build-time needs | Pass / Fail |
| ✅ .dockerignore present | Excludes .git, .env, keys, and build artifacts | Pass / Fail |
| ✅ Resource limits set | --memory, --cpus, --pids-limit configured | Pass / Fail |
| ✅ HEALTHCHECK defined | Dockerfile or orchestrator probe on real readiness path | Pass / Fail |
| ✅ Image scanned in CI | Trivy, Grype, or Docker Scout—no unresolved critical CVEs | Pass / Fail |
| ✅ Image signed | Cosign/Sigstore signature verified at deploy | Pass / Fail |
| ✅ SBOM generated | SPDX or CycloneDX attached for supply chain audit | Pass / Fail |
| ✅ No docker.sock mount | Socket not mounted into application containers | Pass / Fail |
| ✅ No sensitive host binds | /etc, /root, /var/run not mounted | Pass / Fail |
| ✅ Network segmented | DB on internal network; ICC restricted; no --net=host | Pass / Fail |
| ✅ Loopback bind for admin | Metrics/debug on 127.0.0.1—not 0.0.0.0 | Pass / Fail |
| ✅ Secrets at runtime | Vault, AWS SM, or Swarm/K8s secrets—rotatable without rebuild | Pass / Fail |
| ✅ Logging without secrets | App and sidecar logs redact tokens, passwords, PII | Pass / Fail |
| ✅ Incident runbook | Documented steps for CVE response, key rotation, compromised image | Pass / Fail |
Quick validation commands
# Scan image for CVEs
trivy image --severity HIGH,CRITICAL myapp:latest
# Check image history for leaked secrets
docker history --no-trunc myapp:latest | grep -iE 'password|secret|key|token'
# Verify non-root and security opts at runtime
docker inspect mycontainer --format 'User={{.Config.User}} Privileged={{.HostConfig.Privileged}} ReadonlyRootfs={{.HostConfig.ReadonlyRootfs}} SecurityOpt={{.HostConfig.SecurityOpt}}'
# Lint Dockerfile before build
hadolint Dockerfile
Make the checklist a CI gate, not a wiki page. Fail builds on hadolint warnings, Trivy critical CVEs, and root USER in the final stage. Architect approval required for any checklist exception—with expiry date and compensating controls documented.
"How do you secure containers?" — Walk the layers: supply chain (pin, scan, sign), image (non-root, minimal), runtime (cap-drop, seccomp, read-only), secrets (runtime injection), network (segmentation). Mention that containers are not VMs and kernel sharing means defense in depth is mandatory.