Container Security

Containers share the host kernel—security is layered confinement, not perimeter magic. Every production workload needs a threat model, dropped capabilities, seccomp enforcement, non-root execution, secret hygiene, and network segmentation. Skip any layer and a single CVE becomes a host compromise.

devops architect developer CIS Benchmark OCI Runtime

Container security threat model

Before hardening flags, name what you are defending against. Containers are process isolation on a shared kernel— attackers target escape paths, privilege expansion, poisoned images, and lateral movement across flat networks.

Attack surface map

flowchart TB
  subgraph external [External threats]
    REG[Compromised registry image]
    API[Exposed container API]
  end
  subgraph container [Container boundary]
    APP[Application vulnerability]
    ESC[Kernel escape / breakout]
    PRIV[Privilege escalation inside container]
  end
  subgraph host [Host impact]
    SOCK[docker.sock access]
    HOST[Host filesystem / processes]
    LAT[Lateral movement to other containers]
  end
  REG --> APP
  API --> APP
  APP --> PRIV
  PRIV --> ESC
  ESC --> HOST
  SOCK --> HOST
  HOST --> LAT

Six threat categories

Threat What happens Example vector Primary control
Container escape Process breaks namespace/cgroup/seccomp confinement and acts on the host Kernel CVE + --privileged container Drop capabilities, default seccomp, patch kernel, no privileged mode
Privilege escalation UID 0 or excessive capabilities inside container enable host-adjacent actions Writable /etc/passwd, setuid binaries, CAP_SYS_ADMIN Non-root USER, --cap-drop ALL, read-only rootfs
Supply chain Malicious or vulnerable code enters via base image, dependency, or build pipeline Typosquatted image on Docker Hub, compromised maintainer Digest pinning, signing, SBOM, CI scanning, private registry
Secrets exposure Credentials persist in image layers, env vars, logs, or inspect output ENV DB_PASSWORD=... in Dockerfile Runtime secret mounts, Vault/SM, BuildKit secrets, never ARG/ENV
Resource abuse Container consumes all CPU/RAM/PIDs and destabilizes co-located workloads Fork bomb, memory leak without limits --memory, --cpus, --pids-limit
Lateral movement Compromised container reaches other services or the Docker API Flat bridge network + mounted docker.sock Network segmentation, ICC disabled, no socket mounts, mTLS between services

Trust boundaries

  • Image → runtime — treat every pulled image as untrusted until scanned and signed
  • Container → host — assume kernel bugs exist; minimize capabilities that amplify exploits
  • Container → container — default bridge allows east-west traffic; segment explicitly
  • Build → registry — CI runners are high-value targets; isolate builders, rotate credentials
🔒 Security

Mounting /var/run/docker.sock equals giving the container root on the host. Any code inside can spawn privileged siblings, read all containers, and exfiltrate secrets from volumes. Ban this pattern in production—use dedicated APIs or operators instead.

🎯 Interview Tip

"How is container security different from VM security?" — VMs have a hypervisor boundary and separate kernels. Containers rely on kernel primitives (namespaces, cgroups, capabilities, seccomp, LSM). Defense in depth means assuming any single layer can fail.

Linux capabilities

Traditional Unix root is all-or-nothing. Linux capabilities split root powers into granular privileges. Docker drops most capabilities by default—but the remaining set is still dangerous if an app is compromised.

Default: what Docker drops vs keeps

Docker starts from the full capability set, then drops everything not in the allowlist. The default container retains only these capabilities (14 on most kernels):

Kept by default Risk if abused
CAP_CHOWN, CAP_FOWNERChange file ownership inside container—useful for volume permission fixes
CAP_DAC_OVERRIDEBypass file permission checks—dangerous on writable mounts
CAP_SETUID, CAP_SETGID, CAP_SETPCAPEnable privilege escalation via setuid binaries
CAP_NET_BIND_SERVICEBind ports < 1024 without root—needed for nginx on port 80
CAP_NET_RAWRaw sockets—enables packet sniffing and ARP spoofing inside network namespace
CAP_KILLSend signals to any process in the same PID namespace
CAP_SYS_CHROOTChange root directory—needed by some legacy installers
CAP_AUDIT_WRITE, CAP_MKNOD, CAP_FSETID, CAP_SETFCAPLower risk in typical app containers; still unnecessary for many workloads

Dropped by default (never available unless explicitly added):

  • CAP_SYS_ADMIN — mount filesystems, load kernel modules, many namespace operations; primary enabler of container escapes
  • CAP_NET_ADMIN — configure interfaces, iptables, routing tables
  • CAP_SYS_PTRACE — debug/trace any process—read memory of other containers on shared PID namespace
  • CAP_SYS_MODULE, CAP_BPF, CAP_PERFMON — kernel introspection and modification

Production pattern: drop all, add minimum

bash
# Web server binding port 80 as non-root
docker run -d \
  --cap-drop ALL \
  --cap-add NET_BIND_SERVICE \
  --user 1000:1000 \
  --read-only \
  --tmpfs /tmp:rw,noexec,nosuid,size=64m \
  -p 8080:80 \
  nginx:alpine

# Inspect effective capabilities
docker inspect --format '{{.HostConfig.CapDrop}} {{.HostConfig.CapAdd}}' mycontainer

The --privileged flag

--privileged grants all capabilities, disables most seccomp restrictions, and gives access to all host devices. It exists for Docker-in-Docker and hardware debugging—not production apps. CIS Docker Benchmark marks privileged containers as a critical failure.

Inspect capabilities with capsh

terminal — capsh
$ docker run --rm alpine capsh --print
Current: cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,
         cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,
         cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap=ep
Bounding set =cap_chown,cap_dac_override,...

$ docker run --rm --cap-drop ALL --cap-add NET_BIND_SERVICE alpine capsh --print
Current: cap_net_bind_service=ep
Bounding set =cap_net_bind_service
🔒 Security

CAP_NET_RAW is rarely needed in production. Drop it unless your app genuinely requires raw sockets (e.g. ping utilities). Combined with a bridge network, it enables traffic interception between co-located containers.

⚠️ Pitfall

Adding --cap-add SYS_ADMIN to "fix" mount issues is an anti-pattern. It is equivalent to handing an attacker the keys to the host kernel. Fix permissions with volumes, init containers, or proper USER/chown instead.

💡 Pro Tip

In Kubernetes, mirror Docker's cap-drop pattern with securityContext.capabilities.drop: [ALL] and explicit add. The restricted Pod Security Standard enforces this cluster-wide.

Seccomp profiles

Seccomp (secure computing mode) filters syscalls at the kernel boundary. Even with dropped capabilities, dangerous syscalls like mount, ptrace, or keyctl can enable escapes. Docker's default profile blocks ~44 syscalls.

Default Docker seccomp profile

Unless overridden, Docker applies the default seccomp profile (shipped as /etc/docker/seccomp.json on Linux or embedded in dockerd). It uses SCMP_ACT_ERRNO to deny dangerous syscalls while allowing normal application behavior.

Blocked syscall (sample) Why it is blocked
mount, umount2, pivot_rootFilesystem manipulation—escape via mounting host paths
ptraceDebug other processes—read secrets from co-located workloads
reboot, kexec_loadHost lifecycle control
swapon, swapoffModify host swap configuration
init_module, finit_moduleLoad kernel modules
acct, settimeofdaySystem accounting and clock changes
clone with unsafe flagsPrevent namespace creation that bypasses isolation

The full default profile blocks approximately 44 syscalls (plus flag-conditioned rules for clone, socket, and arch_prctl). Architectures may vary slightly.

Custom seccomp JSON

When an application legitimately needs a blocked syscall (rare), create a custom profile that inherits the default and adds explicit allows—never start from an empty allowlist.

json
{
  "defaultAction": "SCMP_ACT_ERRNO",
  "architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_X86", "SCMP_ARCH_X32"],
  "syscalls": [
    {
      "names": ["personality"],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}
bash
# Apply custom profile
docker run --rm --security-opt seccomp=/path/to/custom-seccomp.json myapp

# Verify profile in OCI bundle (running container)
docker inspect --format '{{.HostConfig.SecurityOpt}}' mycontainer

seccomp=unconfined — the danger

--security-opt seccomp=unconfined disables syscall filtering entirely. Combined with any kernel vulnerability, the container has maximum syscall reach. --privileged implies unconfined seccomp. Some legacy apps request unconfined during migration—treat it as technical debt with a sunset date.

🔒 Security

Never run seccomp=unconfined in production. If a container fails with "operation not permitted," identify the specific blocked syscall with strace, then add a surgical allow rule—do not disable the entire filter.

🔬 Under the Hood

Seccomp profiles live in the OCI config.json under linux.seccomp. runc installs the filter before exec. You can diff intended vs actual policy by inspecting the bundle at /run/containerd/io.containerd.runtime.v2.task/.

AppArmor & SELinux

Capabilities and seccomp limit what a process can do. Linux Security Modules (LSM) like AppArmor and SELinux limit what a process can access—which files, sockets, and capabilities match policy labels.

AppArmor (Ubuntu, Debian, SUSE)

Docker automatically applies the docker-default AppArmor profile on supported hosts. It restricts mount operations, raw network access, /proc writes, and ptrace across containers.

bash
# Check AppArmor profile on a running container
docker inspect --format '{{.AppArmorProfile}}' mycontainer
# Expected: docker-default (or custom profile name)

# Explicitly set profile
docker run --security-opt apparmor=docker-default nginx:alpine

# Disable AppArmor (DANGEROUS — debugging only)
docker run --security-opt apparmor=unconfined alpine

SELinux (RHEL, Fedora, CentOS)

On SELinux-enforcing hosts, Docker labels containers with container_t (process type) and content with container_file_t. Bind mounts inherit container_file_t or custom labels via :z / :Z volume flags.

Volume suffix Behavior Use when
:z Shared content label—multiple containers can read/write Shared data directories across containers on same host
:Z Private content label—exclusive to one container Dedicated volume per container (preferred default)
bash
# SELinux: private volume label
docker run -v /data/app:/app:Z myapp

# Custom SELinux level (multi-tenant MLS environments)
docker run --security-opt label=level:s0:c100,c200 myapp

# Check process context
docker exec mycontainer cat /proc/1/attr/current

--security-opt reference

Option LSM Production guidance
apparmor=docker-default AppArmor Default on Ubuntu—keep it; do not set unconfined
label=type:container_t SELinux Default container type on RHEL—verify with ps -eZ
seccomp=default Seccomp Explicit default (same as omitting the flag)
no-new-privileges:true Both Prevents setuid escalation—pair with non-root USER
🔒 Security

apparmor=unconfined and label=disable remove mandatory access control. Only use during local debugging on disposable VMs. Production containers should run with enforcing LSM and no-new-privileges:true.

📦 Real World

OpenShift runs containers with restricted SCCs that enforce SELinux container_t, dropped capabilities, and non-root UIDs by default. If your image works on plain Docker but fails on OpenShift, check USER and volume :Z labels first.

Running as non-root

Container UID 0 maps to host root in rootful Docker. Even with namespaces, a kernel escape while running as root inside the container gives the attacker maximum leverage. Run as an unprivileged numeric UID everywhere.

Dockerfile: USER instruction

dockerfile
FROM eclipse-temurin:21-jre-alpine

# Create dedicated group and user (UID/GID >= 10000)
RUN addgroup -g 10001 -S appgroup && \
    adduser -u 10001 -S appuser -G appgroup

WORKDIR /app
COPY --chown=appuser:appgroup target/app.jar app.jar

USER 10001:10001
ENTRYPOINT ["java", "-jar", "app.jar"]

Why numeric UID matters

Use numeric UIDs (USER 10001:10001) instead of usernames. The username may not exist in the runtime image (multi-stage builds drop /etc/passwd entries), but the numeric UID always resolves. Kubernetes runAsUser: 10001 matches exactly.

Runtime override: --user

bash
# Override image USER at runtime
docker run --user 10001:10001 myapp

# Verify effective user inside container
docker exec mycontainer id
# uid=10001 gid=10001

chown at build time

Files created by COPY or RUN as root are owned by root. Use COPY --chown=appuser:appgroup or a RUN chown -R before switching USER. Writable paths (logs, caches, temp) must be owned by the runtime user—or mounted via volumes/tmpfs.

Read-only root filesystem + tmpfs

bash
docker run -d \
  --read-only \
  --tmpfs /tmp:rw,noexec,nosuid,size=128m \
  --tmpfs /var/run:rw,noexec,nosuid,size=16m \
  --user 10001:10001 \
  --cap-drop ALL \
  myapp

--read-only makes the container layer immutable—attackers cannot drop binaries or modify configs. Mount tmpfs for paths the app must write. Use noexec,nosuid mount options to block executed payloads in temp directories.

Why root is still dangerous

  • Kernel escape amplification — root + CVE = host root in rootful mode
  • Writable application dirs — root can modify binaries, cron, and startup scripts inside the container
  • Package managersapt install in production images adds attack surface at runtime
  • Bind mount ownership — root in container may chown host-mounted files if CAP_CHOWN is present
  • Compliance — PCI-DSS, SOC2, and CIS benchmarks require non-root containers
🔒 Security

Distroless and scratch images have no shell. Combine non-root USER with distroless bases so attackers cannot docker exec into a shell even if they exploit the app. Debugging uses ephemeral debug containers (Kubernetes kubectl debug).

⚠️ Pitfall

Switching to USER app before COPY breaks builds that need root for package installs. Pattern: do all root operations in early layers, chown, then USER as the final instruction before ENTRYPOINT.

Secrets management

Image layers are immutable and inspectable. Any secret baked into a layer survives docker history, registry exports, and backup tapes. Inject credentials at runtime, never at build time via ARG or ENV.

Never use ARG or ENV for secrets

dockerfile
# ❌ ANTI-PATTERN — secret persists in image history forever
ARG DB_PASSWORD
ENV DB_PASSWORD=${DB_PASSWORD}
RUN curl -u admin:${DB_PASSWORD} https://internal/api

# Anyone can recover it:
# docker history --no-trunc myapp:latest
🔒 Security

ENV values appear in docker inspect, process listings (/proc/1/environ), and crash dumps. CI systems that pass --build-arg for npm tokens or API keys have leaked credentials in public Docker Hub images—scan your image history in every PR.

BuildKit secret mount (build-time only)

When a build must reach a private registry or API, use BuildKit secret mounts. Secrets are available only during the RUN line—they never commit to a layer.

dockerfile
# syntax=docker/dockerfile:1
FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN --mount=type=secret,id=npmrc,target=/root/.npmrc \
    npm ci --omit=dev
COPY . .
RUN npm run build
bash
# Pass secret from file at build time (not stored in image)
DOCKER_BUILDKIT=1 docker build \
  --secret id=npmrc,src=$HOME/.npmrc \
  -t myapp .

Runtime secret delivery

Mechanism How it works Best for
Docker Swarm secrets Encrypted at rest, mounted as files in /run/secrets/ Swarm deployments; file-based app config
HashiCorp Vault Dynamic credentials with TTL; AppRole/K8s auth Short-lived DB passwords, PKI certificates
AWS Secrets Manager Central store with rotation Lambda; IAM-scoped access ECS/EKS workloads on AWS; automatic rotation
Azure Key Vault / GCP SM Cloud-native secret stores with audit logging Multi-service cloud estates
Environment injection (orchestrator) K8s Secrets / ECS task defs inject at start—not in image 12-factor apps; pair with encryption at rest
yaml
# Docker Compose — Swarm secrets (deployed stack)
services:
  api:
    image: myapp:latest
    secrets:
      - db_password
    environment:
      DB_PASSWORD_FILE: /run/secrets/db_password

secrets:
  db_password:
    external: true

Rotation

  1. Prefer short-lived credentials — Vault dynamic secrets or AWS SM rotation over static files
  2. Reload without rebuild — app reads secret file on interval or receives SIGHUP; no image rebuild for rotation
  3. Dual-write window — during rotation, accept both old and new credentials for a bounded period
  4. Audit access — log every secret read with principal identity; alert on anomalous patterns
  5. Revoke on incident — runbook to invalidate tokens, roll pods/containers, and scan registry for leaked layers
🔒 Security

Never commit .env files to git. Add them to .dockerignore and .gitignore. Use git-secrets or trufflehog in CI to catch accidental credential commits before they reach the build context.

💡 Pro Tip

Read secrets from files not environment variables when possible—file-based injection (/run/secrets/db_password) avoids exposure in ps e and core dumps. Spring Boot supports spring.config.import=optional:file:/run/secrets/.

Network security

Default bridge networking allows containers to reach each other by IP. Without segmentation, one compromised container can scan and attack every co-located service. Treat container networks as zero-trust east-west paths.

Inter-container communication (ICC)

By default, containers on the same bridge can ping each other. Disable ICC on the docker0 bridge when containers do not need direct peer access (legacy standalone Docker hosts):

json
// /etc/docker/daemon.json
{
  "icc": false
}

In Compose and Swarm, use multiple networks to segment tiers—frontend on web, API on backend, database only attached to backend.

yaml
services:
  web:
    networks: [frontend]
  api:
    networks: [frontend, backend]
  db:
    networks: [backend]   # not reachable from web tier

networks:
  frontend:
  backend:
    internal: true       # no external routing

Network segmentation patterns

Pattern Isolation level When to use
Multi-bridge (Compose networks) Logical tier separation Default starting point for multi-service stacks
internal: true No outbound internet on network Databases, message brokers—no external egress needed
Overlay + encrypted Cross-host with IPsec mesh (Swarm) Multi-host Swarm clusters
Macvlan / IPvlan Container on physical network segment Legacy apps needing LAN IPs; strict firewall rules required
Service mesh (K8s) mTLS + L7 policy per workload Production Kubernetes—NetworkPolicy + Istio/Linkerd

Bind to loopback, not 0.0.0.0

Applications listening on 0.0.0.0 accept connections from any container on the shared network. Configure apps to bind 127.0.0.1 for admin/metrics endpoints, and expose only through a reverse proxy with authentication. Use -p 127.0.0.1:8080:8080 to limit host-side exposure during local dev.

No privileged or host networking in production

Flag Risk Production stance
--net=host Container shares host network stack—no namespace isolation; can bind host ports, sniff traffic Ban in prod; use port publishing or ingress controller
--privileged Full capabilities + device access + unconfined seccomp Ban in prod; use specific cap-add if absolutely required
--pid=host See and signal all host processes Monitoring agents only; isolate with strict RBAC
--ipc=host Shared memory with host and other containers Legacy SHM apps only; document exception
🔒 Security

Published ports (-p 8080:80) expose services to the host network interface. On cloud VMs, security groups must restrict source IPs. Bind to 127.0.0.1 if only local access is needed. Never publish Docker daemon port 2375 without TLS.

🔒 Security

DNS on the default bridge resolves container names. A compromised container can enumerate peers via DNS and port scanning. Segment networks so databases are unreachable from front-end containers that face users.

⚖️ Trade-off

--net=host reduces NAT overhead for high-throughput networking (video, HFT). If you must use it, run on dedicated bare-metal nodes with no other tenants, strict SELinux, and automated compliance scanning.

Image hardening checklist

Gate every image before production. This checklist consolidates build-time, runtime, and operational controls— use it in PR reviews, CI pipelines, and release sign-off.

Control Requirement Status
✅ Non-root userUSER with numeric UID ≥ 10000; no runtime rootPass / Fail
✅ Capabilities dropped--cap-drop ALL; only explicit --cap-add for required capsPass / Fail
✅ No privileged mode--privileged absent from prod manifestsPass / Fail
✅ Seccomp enforcedDefault profile or custom JSON—never unconfinedPass / Fail
✅ LSM activeAppArmor docker-default or SELinux container_tPass / Fail
✅ Read-only rootfs--read-only with tmpfs/volumes for writable pathsPass / Fail
✅ No new privileges--security-opt no-new-privileges:truePass / Fail
✅ Minimal base imagedistroless, alpine, slim, or UBI micro—no full OS desktopPass / Fail
✅ Multi-stage buildBuild tools, compilers, and test deps excluded from final stagePass / Fail
✅ Base image pinnedTag + digest (@sha256:…)—not bare :latestPass / Fail
✅ No secrets in layersNo ARG/ENV credentials; BuildKit secrets for build-time needsPass / Fail
✅ .dockerignore presentExcludes .git, .env, keys, and build artifactsPass / Fail
✅ Resource limits set--memory, --cpus, --pids-limit configuredPass / Fail
✅ HEALTHCHECK definedDockerfile or orchestrator probe on real readiness pathPass / Fail
✅ Image scanned in CITrivy, Grype, or Docker Scout—no unresolved critical CVEsPass / Fail
✅ Image signedCosign/Sigstore signature verified at deployPass / Fail
✅ SBOM generatedSPDX or CycloneDX attached for supply chain auditPass / Fail
✅ No docker.sock mountSocket not mounted into application containersPass / Fail
✅ No sensitive host binds/etc, /root, /var/run not mountedPass / Fail
✅ Network segmentedDB on internal network; ICC restricted; no --net=hostPass / Fail
✅ Loopback bind for adminMetrics/debug on 127.0.0.1—not 0.0.0.0Pass / Fail
✅ Secrets at runtimeVault, AWS SM, or Swarm/K8s secrets—rotatable without rebuildPass / Fail
✅ Logging without secretsApp and sidecar logs redact tokens, passwords, PIIPass / Fail
✅ Incident runbookDocumented steps for CVE response, key rotation, compromised imagePass / Fail

Quick validation commands

bash
# Scan image for CVEs
trivy image --severity HIGH,CRITICAL myapp:latest

# Check image history for leaked secrets
docker history --no-trunc myapp:latest | grep -iE 'password|secret|key|token'

# Verify non-root and security opts at runtime
docker inspect mycontainer --format 'User={{.Config.User}} Privileged={{.HostConfig.Privileged}} ReadonlyRootfs={{.HostConfig.ReadonlyRootfs}} SecurityOpt={{.HostConfig.SecurityOpt}}'

# Lint Dockerfile before build
hadolint Dockerfile
🔒 Security

Make the checklist a CI gate, not a wiki page. Fail builds on hadolint warnings, Trivy critical CVEs, and root USER in the final stage. Architect approval required for any checklist exception—with expiry date and compensating controls documented.

🎯 Interview Tip

"How do you secure containers?" — Walk the layers: supply chain (pin, scan, sign), image (non-root, minimal), runtime (cap-drop, seccomp, read-only), secrets (runtime injection), network (segmentation). Mention that containers are not VMs and kernel sharing means defense in depth is mandatory.