Docker Images & Layers
An image is not a single file—it is an ordered stack of read-only layers, each a filesystem diff, content-addressed by SHA256 digest. Every Dockerfile instruction is a potential layer and a potential cache boundary. Master layer ordering and you master fast CI builds, small production images, and reproducible deploys.
Image anatomy
Think of an image as a git repository frozen in time—each layer is a commit (filesystem diff), the manifest is the branch pointer, and the config is metadata about how to run the result.
What an image contains
- Ordered read-only layers — each layer is a tar archive of files added, modified, or deleted (whiteout markers for deletes)
- Image manifest — JSON listing layer digests + config digest; may be a manifest list for multi-arch
- Image config — env vars, entrypoint, cmd, exposed ports, labels, build history metadata
- Content-addressed storage — layers identified by SHA256 digest; same bytes = same digest everywhere
┌─────────────────────────────────────────┐ ← image config (JSON)
│ CMD, ENV, ENTRYPOINT, ExposedPorts │
├─────────────────────────────────────────┤
│ Layer 4 (RUN npm run build) diff │ ← read-only
├─────────────────────────────────────────┤
│ Layer 3 (COPY src/ ./src/) diff │
├─────────────────────────────────────────┤
│ Layer 2 (RUN npm ci) diff │
├─────────────────────────────────────────┤
│ Layer 1 (COPY package*.json) diff │
├─────────────────────────────────────────┤
│ Layer 0 (FROM node:20-alpine) base │ ← parent image layers
└─────────────────────────────────────────┘
merged view → container rootfs (+ writable upperdir at runtime)
Manifest and config
When you docker pull nginx:1.25, the client fetches the manifest for that tag, then pulls each layer blob by digest. The tag is a mutable pointer; the digest (nginx@sha256:abc…) is immutable. Production deploys should pin digests.
Dangling images
After rebuilding, old layer chains lose their tag—they become dangling images (<none> in docker image ls). They still consume disk until pruned. CI runners accumulate these rapidly without docker image prune or registry lifecycle policies.
# Inspect image layers and sizes
docker history --no-trunc myapp:1.0
# Show manifest digest (immutable reference)
docker inspect --format '{{index .RepoDigests 0}}' nginx:1.25-alpine
# Find dangling images
docker images -f dangling=true
Locally, layers live under /var/lib/docker/overlay2/ (or containerd's content store). Registry storage is identical in structure—blobs keyed by digest. This is why pulling an image someone else built reuses layers you already have.
"What's the difference between an image and a container?" — An image is read-only layers + config. A container is an image + a writable container layer + runtime config (network, mounts, cgroup limits). Many containers can share one image's lower layers.
Layer caching
Docker's build cache is the difference between a 30-second CI build and a 10-minute one. Each instruction is a cache key; change one line and every layer after it rebuilds.
How the cache works
For each Dockerfile instruction, BuildKit/Docker computes a cache key from:
- The instruction text itself
- The parent layer digest
- A hash of files referenced by COPY/ADD (build context)
Cache hit → skip execution, reuse existing layer. Cache miss → execute instruction, create new layer, invalidate all subsequent instructions.
The golden pattern
Order from least-changing to most-changing:
- OS packages and system dependencies
- Language dependency manifests (package-lock.json, pom.xml, requirements.txt)
- Install dependencies (npm ci, mvn dependency:go-offline)
- Application source code last
# syntax=docker/dockerfile:1.6
FROM node:20-alpine
WORKDIR /app
# 1. Copy only lockfiles — cache survives source changes
COPY package.json package-lock.json ./
RUN npm ci --omit=dev
# 2. Source changes only invalidate this layer
COPY . .
RUN npm run build
Cache invalidation triggers
| Trigger | Effect | Mitigation |
|---|---|---|
| Any instruction text change | Miss at that layer + all following | Pin versions; avoid churn in early layers |
| COPY . . early in Dockerfile | Any file change busts entire cache | Copy manifests first; use .dockerignore |
| Build context includes node_modules | Context hash changes constantly | .dockerignore excludes heavy dirs |
| --no-cache flag | Full rebuild | Use only when debugging cache issues |
| Base image tag updated (:latest) | Miss from FROM onward | Pin to digest or specific patch tag |
COPY . . as the first instruction after FROM is the most common cache killer. Every code edit rebuilds dependency installation. Always copy lockfiles first.
Use docker build --progress=plain to see cache hit/miss per step. In CI, use registry cache (--cache-from) so cold runners still benefit from previous pipeline runs.
Dockerfile instructions deep dive
Every instruction has runtime vs build-time semantics, layer implications, and production pitfalls. The table below is your reference—then we walk the critical ones in detail.
| Instruction | Purpose | Example | Pitfall |
|---|---|---|---|
| FROM | Base image; starts new stage | FROM eclipse-temurin:21-jre@sha256:… | :latest breaks reproducibility |
| RUN | Execute command; creates layer | RUN apt-get update && apt-get install -y curl | Each RUN = layer; chain with && |
| COPY | Copy from build context | COPY --chown=app:app target/app.jar . | Large context slows hash computation |
| ADD | Copy + tar auto-extract + URL fetch | ADD https://…/file.tar.gz /tmp/ | Prefer COPY; ADD surprises in review |
| WORKDIR | Set working directory | WORKDIR /app | Use absolute paths only |
| ENV | Runtime environment variable | ENV NODE_ENV=production | Visible in docker inspect |
| ARG | Build-time variable only | ARG MAVEN_VERSION=3.9 | Visible in docker history—never secrets |
| EXPOSE | Document intended port | EXPOSE 8080 | Does NOT publish port to host |
| ENTRYPOINT | Main executable (PID 1 target) | ENTRYPOINT ["java","-jar","app.jar"] | Shell form makes sh PID 1 |
| CMD | Default args to ENTRYPOINT | CMD ["--spring.profiles.active=prod"] | Overridden by docker run … cmd |
| USER | Run as non-root user | USER 1001 | Set before final CMD/ENTRYPOINT |
| HEALTHCHECK | Container health probe | HEALTHCHECK CMD curl -f http://localhost/actuator/health | Missing in prod = blind orchestration |
| LABEL | Image metadata | LABEL org.opencontainers.image.version="1.2.0" | Use OCI label schema for tooling |
| ONBUILD | Trigger for child images | ONBUILD COPY . /app | Surprising inheritance; rare today |
| SHELL | Override default shell for RUN | SHELL ["/bin/bash","-c"] | Affects only shell-form RUN |
FROM — base image and stages
Every Dockerfile begins with FROM. Special bases: scratch (empty—used for static Go binaries), distroless (minimal runtime, no shell). Always pin to digest in production: FROM debian:bookworm-slim@sha256:….
RUN — shell form vs exec form
Shell form: RUN npm install → runs as /bin/sh -c "npm install". Exec form: RUN ["npm", "install"] — no shell, no variable expansion. Chain commands with && in one RUN to avoid extra layers and ensure fail-fast.
RUN apt-get update \
&& apt-get install -y --no-install-recommends curl ca-certificates \
&& rm -rf /var/lib/apt/lists/*
COPY vs ADD
COPY is explicit—files from build context only. ADD additionally auto-extracts local tar archives and can fetch URLs (without cache benefits of COPY). Docker official best practice: use COPY unless you specifically need tar extraction.
ARG vs ENV — build-time vs runtime
ARG exists only during docker build—not in running containers. ENV persists into runtime and appears in docker inspect. Neither is safe for secrets: ARG shows in history, ENV shows in inspect and child images.
ENTRYPOINT vs CMD — PID 1 semantics
| Form | Example | PID 1 process | SIGTERM behavior |
|---|---|---|---|
| Exec ENTRYPOINT | ENTRYPOINT ["java","-jar","app.jar"] | java | JVM receives SIGTERM → graceful shutdown |
| Shell CMD | CMD java -jar app.jar | /bin/sh | SIGTERM to sh; java may not exit cleanly → SIGKILL after 10s |
| ENTRYPOINT + CMD | ENTRYPOINT ["java","-jar"] + CMD ["app.jar"] | java | CMD args append; override at runtime |
Shell form CMD/ENTRYPOINT wraps your app in /bin/sh -c. The shell becomes PID 1—it does not forward signals to children. Spring Boot graceful shutdown requires exec form or tini/--init.
USER — non-root by default in production
Set USER before the final CMD/ENTRYPOINT. Prefer numeric UID (USER 1001) over username to avoid base-image-specific user tables. Use COPY --chown=1001:1001 during build—don't chown at runtime.
HEALTHCHECK
Defines how Docker (and Compose/K8s via translation) determines if the container is healthy. Parameters: --interval, --timeout, --retries, --start-period. Unhealthy status appears in docker ps STATUS column.
Secrets in layers: ARG API_KEY=xxx and ENV API_KEY=xxx persist in image history forever. Use BuildKit --mount=type=secret at build time and runtime secret injection (Vault, K8s Secrets) for credentials.
Interactive layer explorer
Click each Dockerfile instruction to watch layers stack—like git commits building on each other. Notice how dependency layers cache independently from source code.
# Node.js production build — cache-optimized FROM node:20-alpine WORKDIR /app COPY package.json package-lock.json ./ RUN npm ci --omit=dev COPY . . RUN npm run build USER node CMD ["npm", "start"]
Image layers (bottom → top)
Click a Dockerfile instruction to see layers stack up (like git commits).
Multi-stage builds
Problem: build tools (Maven, gcc, npm devDependencies) bloat production images and expand attack surface. Solution: compile in a builder stage, copy only artifacts into a minimal runtime stage.
Named stages and COPY --from
Each FROM begins a new stage. Name stages with AS builder. COPY --from=builder pulls files from a previous stage—not from the build context. Only the final stage becomes the tagged image (unless you --target a specific stage).
Java — Maven build → JRE runtime
# syntax=docker/dockerfile:1.6
FROM maven:3.9-eclipse-temurin-21 AS builder
WORKDIR /build
COPY pom.xml .
RUN mvn -B dependency:go-offline
COPY src ./src
RUN mvn -B package -DskipTests
FROM eclipse-temurin:21-jre-alpine
WORKDIR /app
COPY --from=builder /build/target/*.jar app.jar
USER 1001
ENTRYPOINT ["java", "-jar", "app.jar"]
Node.js — build → alpine runtime
FROM node:20 AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build
FROM node:20-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
USER node
CMD ["node", "dist/server.js"]
Go — static binary → scratch
FROM golang:1.22-alpine AS builder
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -ldflags="-s -w" -o /app .
FROM scratch
COPY --from=builder /app /app
USER 65532:65532
ENTRYPOINT ["/app"]
Test stage — fail build if tests fail
FROM builder AS test
RUN mvn -B test
FROM eclipse-temurin:21-jre-alpine AS runtime
COPY --from=builder /build/target/*.jar /app.jar
CI runs docker build --target test to execute tests in the build graph without shipping the test stage.
Spring Boot 2.3+ supports layered JARs—extract dependencies, spring-boot-loader, and application into separate COPY layers so dependency changes don't invalidate app code layers. See Production Dockerfile Patterns.
Base image selection
Your base image is the floor for size, CVE count, and compatibility. The wrong base costs weeks of musl/glibc debugging or bloated scans. Choose deliberately per workload—not reflexively "Alpine because small."
| Base | Approx size | Best for | Watch out |
|---|---|---|---|
| ubuntu:22.04 / debian:bookworm | ~78 MB | General purpose, apt packages, glibc | Large; many packages = more CVEs |
| debian:bookworm-slim | ~74 MB | Debian compatibility, fewer packages | Still glibc; needs apt hygiene in RUN |
| alpine:3.19 | ~7 MB | Static binaries, Go, Node without native addons | musl ≠ glibc — Java/native libs may break |
| gcr.io/distroless/* | ~20–50 MB | Production Java/Node/Go — no shell, minimal CVEs | No shell for debugging; use debug variants |
| eclipse-temurin / amazoncorretto | Varies by tag | JVM apps with vendor support | Use -jre not -jdk in runtime stage |
| Red Hat UBI | ~80 MB | Enterprise/OpenShift, RHEL-compatible | Subscription not required to run; good governance |
| scratch | 0 B | Static Go/Rust binaries only | No libc, no CA certs unless you COPY them |
Alpine and musl gotchas
Alpine uses musl libc instead of glibc. Many prebuilt native binaries (Oracle JDK, some Python wheels, Node native modules) assume glibc. Java on Alpine needs a musl-aware build or an extra glibc compatibility layer— often negating size wins. For JVM production, prefer distroless/java21 or eclipse-temurin:21-jre-alpine with tested native deps.
Distroless — Google's minimal production base
Distroless images contain only your app and runtime dependencies—no shell, no package manager, no curl. Attack surface shrinks dramatically. Debug with :debug tags (include busybox shell) during development only.
Image size vs compatibility: Alpine saves MB but costs engineering time when native deps break. Distroless saves security review time but complicates ad-hoc docker exec debugging. Full Debian/Ubuntu maximizes compatibility at scan and transfer cost.
Google runs distroless for most internal services. Netflix maintains curated base images with pre-approved packages and automated CVE patching. Platform teams often publish one blessed base per language—developers inherit governance by default.
Image size optimization
Smaller images pull faster, scan faster, and deploy faster. Measure first—optimize what actually matters on the critical path.
Measure before optimizing
# List images by size
docker image ls --format 'table {{.Repository}}\t{{.Tag}}\t{{.Size}}'
# Per-layer breakdown
docker history --human --no-trunc myapp:latest
# Deep analysis (install dive: https://github.com/wagoodman/dive)
dive myapp:latest
Techniques ranked by impact
| Technique | Typical savings | Example |
|---|---|---|
| Multi-stage builds | 50–90% (removes build tools) | Maven builder → JRE runtime stage |
| Slim/distroless base | 30–70% vs full OS | gcr.io/distroless/java21-debian12 |
| Combine RUN commands | Fewer layers; remove apt cache | rm -rf /var/lib/apt/lists/* in same RUN |
| .dockerignore | Smaller context + faster cache hash | Exclude .git, node_modules, target/ |
| COPY --chown | Avoids extra chown layer | COPY --chown=1001:1001 app.jar . |
| BuildKit --squash | Merges layers (loses cache granularity) | Experimental; rarely needed with multi-stage |
.dockerignore essentials
.git
.gitignore
node_modules
npm-debug.log
target/
*.md
.env
.env.*
coverage/
.idea/
.vscode/
**/*_test.go
Dockerfile*
apt-get without cache cleanup: RUN apt-get install -y curl leaves /var/lib/apt/lists/* in the layer forever—often 50+ MB. Always delete package manager caches in the same RUN layer.
dive shows layer efficiency score and wasted space. Aim for >95% efficiency in production images. If one RUN layer adds 200 MB, that's your optimization target—not shaving 1 MB off a label.
BuildKit
BuildKit is Docker's modern build engine (default Docker 23+). It parallelizes independent stages, mounts caches and secrets without polluting layers, and exports build artifacts flexibly.
Enable BuildKit
# Per-build
DOCKER_BUILDKIT=1 docker build -t myapp .
# Permanent (daemon.json)
# { "features": { "buildkit": true } }
# Dockerfile syntax directive (unlocks latest features)
# syntax=docker/dockerfile:1.6
Cache mounts — persist across builds
Unlike regular layers, cache mounts are not committed to the image. Maven, npm, and Go module caches survive between builds without bloating the final image.
# syntax=docker/dockerfile:1.6
FROM maven:3.9-eclipse-temurin-21 AS builder
WORKDIR /build
COPY pom.xml .
RUN --mount=type=cache,target=/root/.m2 \
mvn -B dependency:go-offline
COPY src ./src
RUN --mount=type=cache,target=/root/.m2 \
mvn -B package -DskipTests
Secret mounts — never in layers
# Build with secret (not stored in image)
docker build --secret id=npmrc,src=$HOME/.npmrc -t myapp .
# Dockerfile
# RUN --mount=type=secret,id=npmrc,target=/root/.npmrc npm ci
SSH mounts — private git clones
RUN --mount=type=ssh forwards your SSH agent into the build—clone private repos without embedding keys in the image. Build with docker build --ssh default.
Registry cache — fast CI
docker buildx build \
--cache-from type=registry,ref=myregistry/myapp:buildcache \
--cache-to type=registry,ref=myregistry/myapp:buildcache,mode=max \
--push -t myregistry/myapp:latest .
BuildKit features reference
| Feature | Purpose | Example |
|---|---|---|
| Parallel stages | Independent stages build concurrently | Frontend + backend multi-stage in one Dockerfile |
| --mount=type=cache | Persistent package manager caches | target=/root/.m2, /root/.npm |
| --mount=type=secret | Build-time credentials | NPM token, pip index password |
| --mount=type=ssh | SSH agent forwarding | Private GitHub dependencies |
| --output type=local | Export files without image | dest=./dist for static sites |
| Inline cache | Embed cache metadata in pushed image | --cache-to type=inline |
| Provenance/SBOM | Supply chain attestations | --attest type=sbom |
flowchart LR CTX[Build context] BK[BuildKit solver] S1[Stage: builder] S2[Stage: runtime] CACHE[(Cache mounts\n.m2 / npm)] SEC[Secret mounts] REG[(Registry cache)] IMG[Final image] CTX --> BK BK --> S1 BK --> S2 S1 --> CACHE S1 --> SEC BK --> REG S2 --> IMG
BuildKit uses a DAG solver—only rebuilds nodes whose inputs changed. Legacy builder was linear instruction-by-instruction. That's why independent stages and cache mounts dramatically outperform old docker build on large projects.
"How do you pass secrets to a Docker build?" — Wrong: ARG/ENV. Right: BuildKit --mount=type=secret (build) + runtime secret managers (deploy). Mention secrets never appear in docker history.