Docker Images & Layers

Image anatomy

Think of an image as a git repository frozen in time—each layer is a commit (filesystem diff), the manifest is the branch pointer, and the config is metadata about how to run the result.

What an image contains

Ordered read-only layers — each layer is a tar archive of files added, modified, or deleted (whiteout markers for deletes)
Image manifest — JSON listing layer digests + config digest; may be a manifest list for multi-arch
Image config — env vars, entrypoint, cmd, exposed ports, labels, build history metadata
Content-addressed storage — layers identified by SHA256 digest; same bytes = same digest everywhere

┌─────────────────────────────────────────┐  ← image config (JSON)
│  CMD, ENV, ENTRYPOINT, ExposedPorts     │
├─────────────────────────────────────────┤
│  Layer 4  (RUN npm run build)     diff  │  ← read-only
├─────────────────────────────────────────┤
│  Layer 3  (COPY src/ ./src/)      diff  │
├─────────────────────────────────────────┤
│  Layer 2  (RUN npm ci)            diff  │
├─────────────────────────────────────────┤
│  Layer 1  (COPY package*.json)    diff  │
├─────────────────────────────────────────┤
│  Layer 0  (FROM node:20-alpine)   base  │  ← parent image layers
└─────────────────────────────────────────┘
         merged view → container rootfs (+ writable upperdir at runtime)

Manifest and config

When you docker pull nginx:1.25, the client fetches the manifest for that tag, then pulls each layer blob by digest. The tag is a mutable pointer; the digest (nginx@sha256:abc…) is immutable. Production deploys should pin digests.

Dangling images

After rebuilding, old layer chains lose their tag—they become dangling images (<none> in docker image ls). They still consume disk until pruned. CI runners accumulate these rapidly without docker image prune or registry lifecycle policies.

# Inspect image layers and sizes
docker history --no-trunc myapp:1.0

# Show manifest digest (immutable reference)
docker inspect --format '{{index .RepoDigests 0}}' nginx:1.25-alpine

# Find dangling images
docker images -f dangling=true

🔬 Under the Hood

Locally, layers live under /var/lib/docker/overlay2/ (or containerd's content store). Registry storage is identical in structure—blobs keyed by digest. This is why pulling an image someone else built reuses layers you already have.

🎯 Interview Tip

"What's the difference between an image and a container?" — An image is read-only layers + config. A container is an image + a writable container layer + runtime config (network, mounts, cgroup limits). Many containers can share one image's lower layers.

Layer caching

Docker's build cache is the difference between a 30-second CI build and a 10-minute one. Each instruction is a cache key; change one line and every layer after it rebuilds.

How the cache works

For each Dockerfile instruction, BuildKit/Docker computes a cache key from:

The instruction text itself
The parent layer digest
A hash of files referenced by COPY/ADD (build context)

Cache hit → skip execution, reuse existing layer. Cache miss → execute instruction, create new layer, invalidate all subsequent instructions.

The golden pattern

Order from least-changing to most-changing:

OS packages and system dependencies
Language dependency manifests (package-lock.json, pom.xml, requirements.txt)
Install dependencies (npm ci, mvn dependency:go-offline)
Application source code last

# syntax=docker/dockerfile:1.6
FROM node:20-alpine
WORKDIR /app
# 1. Copy only lockfiles — cache survives source changes
COPY package.json package-lock.json ./
RUN npm ci --omit=dev
# 2. Source changes only invalidate this layer
COPY . .
RUN npm run build

Cache invalidation triggers

Trigger	Effect	Mitigation
Any instruction text change	Miss at that layer + all following	Pin versions; avoid churn in early layers
COPY . . early in Dockerfile	Any file change busts entire cache	Copy manifests first; use .dockerignore
Build context includes node_modules	Context hash changes constantly	.dockerignore excludes heavy dirs
--no-cache flag	Full rebuild	Use only when debugging cache issues
Base image tag updated (:latest)	Miss from FROM onward	Pin to digest or specific patch tag

⚠️ Pitfall

COPY . . as the first instruction after FROM is the most common cache killer. Every code edit rebuilds dependency installation. Always copy lockfiles first.

💡 Pro Tip

Use docker build --progress=plain to see cache hit/miss per step. In CI, use registry cache (--cache-from) so cold runners still benefit from previous pipeline runs.

Dockerfile instructions deep dive

Every instruction has runtime vs build-time semantics, layer implications, and production pitfalls. The table below is your reference—then we walk the critical ones in detail.

Instruction	Purpose	Example	Pitfall
FROM	Base image; starts new stage	FROM eclipse-temurin:21-jre@sha256:…	:latest breaks reproducibility
RUN	Execute command; creates layer	RUN apt-get update && apt-get install -y curl	Each RUN = layer; chain with &&
COPY	Copy from build context	COPY --chown=app:app target/app.jar .	Large context slows hash computation
ADD	Copy + tar auto-extract + URL fetch	ADD https://…/file.tar.gz /tmp/	Prefer COPY; ADD surprises in review
WORKDIR	Set working directory	WORKDIR /app	Use absolute paths only
ENV	Runtime environment variable	ENV NODE_ENV=production	Visible in docker inspect
ARG	Build-time variable only	ARG MAVEN_VERSION=3.9	Visible in docker history—never secrets
EXPOSE	Document intended port	EXPOSE 8080	Does NOT publish port to host
ENTRYPOINT	Main executable (PID 1 target)	ENTRYPOINT ["java","-jar","app.jar"]	Shell form makes sh PID 1
CMD	Default args to ENTRYPOINT	CMD ["--spring.profiles.active=prod"]	Overridden by docker run … cmd
USER	Run as non-root user	USER 1001	Set before final CMD/ENTRYPOINT
HEALTHCHECK	Container health probe	HEALTHCHECK CMD curl -f http://localhost/actuator/health	Missing in prod = blind orchestration
LABEL	Image metadata	LABEL org.opencontainers.image.version="1.2.0"	Use OCI label schema for tooling
ONBUILD	Trigger for child images	ONBUILD COPY . /app	Surprising inheritance; rare today
SHELL	Override default shell for RUN	SHELL ["/bin/bash","-c"]	Affects only shell-form RUN

FROM — base image and stages

Every Dockerfile begins with FROM. Special bases: scratch (empty—used for static Go binaries), distroless (minimal runtime, no shell). Always pin to digest in production: FROM debian:bookworm-slim@sha256:….

RUN — shell form vs exec form

Shell form: RUN npm install → runs as /bin/sh -c "npm install". Exec form: RUN ["npm", "install"] — no shell, no variable expansion. Chain commands with && in one RUN to avoid extra layers and ensure fail-fast.

RUN apt-get update \
 && apt-get install -y --no-install-recommends curl ca-certificates \
 && rm -rf /var/lib/apt/lists/*

COPY vs ADD

COPY is explicit—files from build context only. ADD additionally auto-extracts local tar archives and can fetch URLs (without cache benefits of COPY). Docker official best practice: use COPY unless you specifically need tar extraction.

ARG vs ENV — build-time vs runtime

ARG exists only during docker build—not in running containers. ENV persists into runtime and appears in docker inspect. Neither is safe for secrets: ARG shows in history, ENV shows in inspect and child images.

ENTRYPOINT vs CMD — PID 1 semantics

Form	Example	PID 1 process	SIGTERM behavior
Exec ENTRYPOINT	ENTRYPOINT ["java","-jar","app.jar"]	java	JVM receives SIGTERM → graceful shutdown
Shell CMD	CMD java -jar app.jar	/bin/sh	SIGTERM to sh; java may not exit cleanly → SIGKILL after 10s
ENTRYPOINT + CMD	ENTRYPOINT ["java","-jar"] + CMD ["app.jar"]	java	CMD args append; override at runtime

⚠️ Pitfall

Shell form CMD/ENTRYPOINT wraps your app in /bin/sh -c. The shell becomes PID 1—it does not forward signals to children. Spring Boot graceful shutdown requires exec form or tini/--init.

USER — non-root by default in production

Set USER before the final CMD/ENTRYPOINT. Prefer numeric UID (USER 1001) over username to avoid base-image-specific user tables. Use COPY --chown=1001:1001 during build—don't chown at runtime.

HEALTHCHECK

Defines how Docker (and Compose/K8s via translation) determines if the container is healthy. Parameters: --interval, --timeout, --retries, --start-period. Unhealthy status appears in docker ps STATUS column.

🔒 Security

Secrets in layers: ARG API_KEY=xxx and ENV API_KEY=xxx persist in image history forever. Use BuildKit --mount=type=secret at build time and runtime secret injection (Vault, K8s Secrets) for credentials.

Interactive layer explorer

Click each Dockerfile instruction to watch layers stack—like git commits building on each other. Notice how dependency layers cache independently from source code.

# Node.js production build — cache-optimized
FROM node:20-alpine
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --omit=dev
COPY . .
RUN npm run build
USER node
CMD ["npm", "start"]

Image layers (bottom → top)

Click a Dockerfile instruction to see layers stack up (like git commits).

Multi-stage builds

Problem: build tools (Maven, gcc, npm devDependencies) bloat production images and expand attack surface. Solution: compile in a builder stage, copy only artifacts into a minimal runtime stage.

Named stages and COPY --from

Each FROM begins a new stage. Name stages with AS builder. COPY --from=builder pulls files from a previous stage—not from the build context. Only the final stage becomes the tagged image (unless you --target a specific stage).

Java — Maven build → JRE runtime

# syntax=docker/dockerfile:1.6
FROM maven:3.9-eclipse-temurin-21 AS builder
WORKDIR /build
COPY pom.xml .
RUN mvn -B dependency:go-offline
COPY src ./src
RUN mvn -B package -DskipTests

FROM eclipse-temurin:21-jre-alpine
WORKDIR /app
COPY --from=builder /build/target/*.jar app.jar
USER 1001
ENTRYPOINT ["java", "-jar", "app.jar"]

Node.js — build → alpine runtime

FROM node:20 AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

FROM node:20-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
USER node
CMD ["node", "dist/server.js"]

Go — static binary → scratch

FROM golang:1.22-alpine AS builder
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -ldflags="-s -w" -o /app .

FROM scratch
COPY --from=builder /app /app
USER 65532:65532
ENTRYPOINT ["/app"]

Test stage — fail build if tests fail

FROM builder AS test
RUN mvn -B test

FROM eclipse-temurin:21-jre-alpine AS runtime
COPY --from=builder /build/target/*.jar /app.jar

CI runs docker build --target test to execute tests in the build graph without shipping the test stage.

💡 Pro Tip

Spring Boot 2.3+ supports layered JARs—extract dependencies, spring-boot-loader, and application into separate COPY layers so dependency changes don't invalidate app code layers. See Production Dockerfile Patterns.

Base image selection

Your base image is the floor for size, CVE count, and compatibility. The wrong base costs weeks of musl/glibc debugging or bloated scans. Choose deliberately per workload—not reflexively "Alpine because small."

Base	Approx size	Best for	Watch out
ubuntu:22.04 / debian:bookworm	~78 MB	General purpose, apt packages, glibc	Large; many packages = more CVEs
debian:bookworm-slim	~74 MB	Debian compatibility, fewer packages	Still glibc; needs apt hygiene in RUN
alpine:3.19	~7 MB	Static binaries, Go, Node without native addons	musl ≠ glibc — Java/native libs may break
gcr.io/distroless/*	~20–50 MB	Production Java/Node/Go — no shell, minimal CVEs	No shell for debugging; use debug variants
eclipse-temurin / amazoncorretto	Varies by tag	JVM apps with vendor support	Use -jre not -jdk in runtime stage
Red Hat UBI	~80 MB	Enterprise/OpenShift, RHEL-compatible	Subscription not required to run; good governance
scratch	0 B	Static Go/Rust binaries only	No libc, no CA certs unless you COPY them

Alpine and musl gotchas

Alpine uses musl libc instead of glibc. Many prebuilt native binaries (Oracle JDK, some Python wheels, Node native modules) assume glibc. Java on Alpine needs a musl-aware build or an extra glibc compatibility layer— often negating size wins. For JVM production, prefer distroless/java21 or eclipse-temurin:21-jre-alpine with tested native deps.

Distroless — Google's minimal production base

Distroless images contain only your app and runtime dependencies—no shell, no package manager, no curl. Attack surface shrinks dramatically. Debug with :debug tags (include busybox shell) during development only.

⚖️ Trade-off

Image size vs compatibility: Alpine saves MB but costs engineering time when native deps break. Distroless saves security review time but complicates ad-hoc docker exec debugging. Full Debian/Ubuntu maximizes compatibility at scan and transfer cost.

📦 Real World

Google runs distroless for most internal services. Netflix maintains curated base images with pre-approved packages and automated CVE patching. Platform teams often publish one blessed base per language—developers inherit governance by default.

Image size optimization

Smaller images pull faster, scan faster, and deploy faster. Measure first—optimize what actually matters on the critical path.

Measure before optimizing

# List images by size
docker image ls --format 'table {{.Repository}}\t{{.Tag}}\t{{.Size}}'

# Per-layer breakdown
docker history --human --no-trunc myapp:latest

# Deep analysis (install dive: https://github.com/wagoodman/dive)
dive myapp:latest

Techniques ranked by impact

Technique	Typical savings	Example
Multi-stage builds	50–90% (removes build tools)	Maven builder → JRE runtime stage
Slim/distroless base	30–70% vs full OS	gcr.io/distroless/java21-debian12
Combine RUN commands	Fewer layers; remove apt cache	rm -rf /var/lib/apt/lists/* in same RUN
.dockerignore	Smaller context + faster cache hash	Exclude .git, node_modules, target/
COPY --chown	Avoids extra chown layer	COPY --chown=1001:1001 app.jar .
BuildKit --squash	Merges layers (loses cache granularity)	Experimental; rarely needed with multi-stage

.dockerignore essentials

.git
.gitignore
node_modules
npm-debug.log
target/
*.md
.env
.env.*
coverage/
.idea/
.vscode/
**/*_test.go
Dockerfile*

⚠️ Pitfall

apt-get without cache cleanup: RUN apt-get install -y curl leaves /var/lib/apt/lists/* in the layer forever—often 50+ MB. Always delete package manager caches in the same RUN layer.

💡 Pro Tip

dive shows layer efficiency score and wasted space. Aim for >95% efficiency in production images. If one RUN layer adds 200 MB, that's your optimization target—not shaving 1 MB off a label.

BuildKit

BuildKit is Docker's modern build engine (default Docker 23+). It parallelizes independent stages, mounts caches and secrets without polluting layers, and exports build artifacts flexibly.

Enable BuildKit

# Per-build
DOCKER_BUILDKIT=1 docker build -t myapp .

# Permanent (daemon.json)
# { "features": { "buildkit": true } }

# Dockerfile syntax directive (unlocks latest features)
# syntax=docker/dockerfile:1.6

Cache mounts — persist across builds

Unlike regular layers, cache mounts are not committed to the image. Maven, npm, and Go module caches survive between builds without bloating the final image.

# syntax=docker/dockerfile:1.6
FROM maven:3.9-eclipse-temurin-21 AS builder
WORKDIR /build
COPY pom.xml .
RUN --mount=type=cache,target=/root/.m2 \
    mvn -B dependency:go-offline
COPY src ./src
RUN --mount=type=cache,target=/root/.m2 \
    mvn -B package -DskipTests

Secret mounts — never in layers

# Build with secret (not stored in image)
docker build --secret id=npmrc,src=$HOME/.npmrc -t myapp .

# Dockerfile
# RUN --mount=type=secret,id=npmrc,target=/root/.npmrc npm ci

SSH mounts — private git clones

RUN --mount=type=ssh forwards your SSH agent into the build—clone private repos without embedding keys in the image. Build with docker build --ssh default.

Registry cache — fast CI

docker buildx build \
  --cache-from type=registry,ref=myregistry/myapp:buildcache \
  --cache-to type=registry,ref=myregistry/myapp:buildcache,mode=max \
  --push -t myregistry/myapp:latest .

BuildKit features reference

Feature	Purpose	Example
Parallel stages	Independent stages build concurrently	Frontend + backend multi-stage in one Dockerfile
--mount=type=cache	Persistent package manager caches	target=/root/.m2, /root/.npm
--mount=type=secret	Build-time credentials	NPM token, pip index password
--mount=type=ssh	SSH agent forwarding	Private GitHub dependencies
--output type=local	Export files without image	dest=./dist for static sites
Inline cache	Embed cache metadata in pushed image	--cache-to type=inline
Provenance/SBOM	Supply chain attestations	--attest type=sbom

flowchart LR
  CTX[Build context]
  BK[BuildKit solver]
  S1[Stage: builder]
  S2[Stage: runtime]
  CACHE[(Cache mounts\n.m2 / npm)]
  SEC[Secret mounts]
  REG[(Registry cache)]
  IMG[Final image]
  CTX --> BK
  BK --> S1
  BK --> S2
  S1 --> CACHE
  S1 --> SEC
  BK --> REG
  S2 --> IMG

🔬 Under the Hood

BuildKit uses a DAG solver—only rebuilds nodes whose inputs changed. Legacy builder was linear instruction-by-instruction. That's why independent stages and cache mounts dramatically outperform old docker build on large projects.

🎯 Interview Tip

"How do you pass secrets to a Docker build?" — Wrong: ARG/ENV. Right: BuildKit --mount=type=secret (build) + runtime secret managers (deploy). Mention secrets never appear in docker history.