Workloads

Pod — The Atomic Unit

A Pod is the smallest deployable unit in Kubernetes—not a container, but a group of one or more containers that share a network namespace, optional IPC, and optionally volumes. The scheduler places pods; controllers create and manage them.

Shared network & IPC

All containers in a pod share the same Pod IP and port space—localhost between containers works. They can share an IPC namespace (shareProcessNamespace: true) for sidecar debugging or legacy patterns. Volumes declared at pod spec level are mounted into selected containers.

Spec anatomy

Field	Purpose
spec.containers[]	Main application containers (required, at least one)
spec.initContainers[]	Run to completion before app containers start
spec.volumes[]	Shared storage—emptyDir, PVC, ConfigMap, Secret
spec.nodeName	Bypass scheduler—bind pod to specific node (rare)
spec.restartPolicy	Always (default), OnFailure, Never
spec.serviceAccountName	Identity for API access and image pull secrets
spec.terminationGracePeriodSeconds	Time between SIGTERM and SIGKILL (default 30s)

Lifecycle phases

Pod status.phase is coarse-grained. Fine-grained readiness comes from status.conditions and per-container state.

stateDiagram-v2
  [*] --> Pending: scheduled / pulling image
  Pending --> Running: all containers started
  Running --> Succeeded: all exit 0 (restartPolicy Never/OnFailure)
  Running --> Failed: container error / OOM / evicted
  Running --> Unknown: node lost contact
  Succeeded --> [*]
  Failed --> [*]

Phase	Meaning	Typical cause
Pending	Accepted but not all containers running	Scheduling, image pull, init containers, PVC binding
Running	At least one container running or starting	Normal operation
Succeeded	All containers terminated successfully	Job pods, one-shot tasks
Failed	At least one container failed; none running	CrashLoopBackOff, OOMKilled, exit non-zero

Conditions & container states

PodScheduled — scheduler assigned a node
Initialized — all init containers completed
ContainersReady — all containers pass readiness probes
Ready — pod can receive Service traffic

Container state is one of: Waiting (reason: ContainerCreating, CrashLoopBackOff, ImagePullBackOff), Running, or Terminated (exit code, signal, OOM flag).

$ kubectl get pod web-7d4f8b9c-xk2lm -o wide
NAME              READY   STATUS    RESTARTS   AGE   IP           NODE
web-7d4f8b9c-xk2lm   1/1     Running   0          5m    10.244.1.15  worker-2
$ kubectl describe pod web-7d4f8b9c-xk2lm | grep -A5 Conditions
$ kubectl get pod web-7d4f8b9c-xk2lm -o jsonpath='{.status.podIP}'
10.244.1.15$ oc get pod web-7d4f8b9c-xk2lm -o wide
$ oc describe pod web-7d4f8b9c-xk2lm
$ oc get pod web-7d4f8b9c-xk2lm -o jsonpath='{.status.podIP}'

⚠️ Pitfall

Never run bare pods in production. A standalone Pod is not self-healing—delete it and it stays gone. Node failure loses the workload permanently. Always use a controller (Deployment, StatefulSet, etc.) that owns the pod template.

🔬 Under the Hood

The kubelet assigns a pod IP from the CNI plugin's range on that node. When the pod dies, the IP is released— never hardcode pod IPs. Services provide stable virtual IPs; headless Services return pod DNS for StatefulSets.

🎯 Interview Tip

"What's the difference between pod phase and container state?" — Phase is pod-level summary; container state is per-container (waiting/running/terminated). A pod can be Running while a sidecar is in CrashLoopBackOff if the main container is up but not all are ready (READY 1/2).

Init Containers

Init containers run sequentially before app containers start. Each must exit successfully before the next begins. Use them for setup tasks that must complete before the main process runs.

Sequential execution

Order follows the array in spec.initContainers. If any init container fails (non-zero exit), Kubernetes restarts the pod according to restartPolicy—with backoff for repeated failures.

Common use cases

Wait for dependencies—database migrations, service mesh proxy registration
Fetch secrets or config from external systems into a shared emptyDir
Run database schema migrations before app starts
Set filesystem permissions on volumes (fix ownership for non-root UIDs)
Clone git repos or download artifacts into a shared volume

apiVersion: v1
kind: Pod
metadata:
  name: app-with-init
spec:
  initContainers:
    - name: wait-for-db
      image: busybox:1.36
      command: ['sh', '-c', 'until nc -z postgres 5432; do sleep 2; done']
    - name: migrate
      image: myapp:2.1.0
      command: ['./migrate.sh']
  containers:
    - name: app
      image: myapp:2.1.0
      ports:
        - containerPort: 8080

Sidecar containers (KEP-753)

Kubernetes 1.29+ supports native sidecars: init containers with restartPolicy: Always start before app containers, keep running alongside them, and terminate after app containers exit—proper lifecycle for service mesh proxies and log shippers.

spec:
  initContainers:
    - name: istio-proxy
      image: istio/proxyv2:1.22
      restartPolicy: Always          # KEP-753 sidecar
      ports:
        - containerPort: 15090
  containers:
    - name: app
      image: myapp:2.1.0

💡 Pro Tip

Before KEP-753, teams used regular containers as sidecars—but they started in parallel with the app, causing race conditions. Prefer restartPolicy: Always on init containers for mesh/logging sidecars on K8s 1.29+.

⚖️ Trade-off

Init containers vs Jobs: Init runs inside the pod lifecycle—good for per-pod setup. A Kubernetes Job is better for one-time cluster-wide migrations or batch prep that shouldn't block every pod restart.

Deployments

A Deployment declares desired state for stateless applications. It owns a ReplicaSet, which owns Pods. Change the pod template → new ReplicaSet → rolling update replaces old pods incrementally.

ReplicaSet + rolling update

Each template change creates a new ReplicaSet with a unique pod-template-hash label. The Deployment controller scales the new RS up and the old RS down according to the update strategy.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
  labels:
    app: web
spec:
  replicas: 3
  revisionHistoryLimit: 5
  selector:
    matchLabels:
      app: web
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1    # max pods below desired during update
      maxSurge: 1          # max extra pods above desired during update
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
        - name: nginx
          image: nginx:1.25
          ports:
            - containerPort: 80
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 256Mi
          readinessProbe:
            httpGet:
              path: /
              port: 80
            initialDelaySeconds: 5
            periodSeconds: 5

Update strategies

Strategy	Behavior	When to use
RollingUpdate	Replace pods incrementally via maxUnavailable/maxSurge	Default—zero-downtime with readiness probes
Recreate	Terminate all old pods, then create new ones	Single-replica apps, incompatible versions, dev/staging

maxUnavailable and maxSurge accept integers or percentages. With 3 replicas and maxUnavailable: 1, maxSurge: 1, you may briefly run 4 pods (3 old + 1 new) or drop to 2 (2 old while 1 new starts).

Rollback, pause, and revision history

revisionHistoryLimit — old ReplicaSets kept for rollback (default 10)
kubectl rollout pause — freeze mid-update for canary testing
kubectl rollout resume — continue paused rollout
kubectl rollout undo — revert to previous ReplicaSet

$ kubectl set image deployment/web nginx=nginx:1.26
$ kubectl rollout status deployment/web --timeout=5m
$ kubectl rollout history deployment/web
$ kubectl rollout undo deployment/web --to-revision=2
$ kubectl rollout pause deployment/web
$ kubectl rollout resume deployment/web$ oc set image deployment/web nginx=nginx:1.26
$ oc rollout status deployment/web
$ oc rollout history deployment/web
$ oc rollout undo deployment/web --to-revision=2

⚙️ Config

Production rolling updates require readiness probes. Without them, Kubernetes considers pods ready as soon as the container starts— traffic hits half-initialized apps. Set progressDeadlineSeconds (default 600) to fail stuck rollouts.

📦 Real World

Teams using GitOps (ArgoCD/Flux) rarely run kubectl set image manually—image tag changes flow through git. Rollback becomes git revert plus sync. Keep rollout undo in your incident playbook for emergencies.

⚠️ Pitfall

Immutable label selectors — you cannot change spec.selector on an existing Deployment. Changing pod labels without updating the selector orphan pods. Use kubectl apply --server-side carefully.

StatefulSets

StatefulSets manage pods that need stable network identity and stable storage. Pods get predictable names (web-0, web-1, web-2) and persistent volumes that follow them across reschedules.

Stable names & DNS

Pod hostname is <statefulset-name>-<ordinal>. With a headless Service named web, DNS records resolve per pod:

web-0.web.default.svc.cluster.local
web-1.web.default.svc.cluster.local

Ordered startup & termination

Pods start sequentially: web-0 must be Running and Ready before web-1 starts. Scale-down terminates highest ordinal first (web-2 before web-1).

apiVersion: v1
kind: Service
metadata:
  name: kafka
spec:
  clusterIP: None          # headless — required for StatefulSet DNS
  selector:
    app: kafka
  ports:
    - port: 9092
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: kafka
spec:
  serviceName: kafka       # links to headless Service
  replicas: 3
  podManagementPolicy: OrderedReady
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app: kafka
  template:
    metadata:
      labels:
        app: kafka
    spec:
      containers:
        - name: kafka
          image: apache/kafka:3.7.0
          ports:
            - containerPort: 9092
          volumeMounts:
            - name: data
              mountPath: /var/lib/kafka
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: fast-ssd
        resources:
          requests:
            storage: 100Gi

Update strategies

Strategy	Behavior
RollingUpdate	Update pods in reverse ordinal order (default)
OnDelete	Manual—delete each pod to trigger update
podManagementPolicy: Parallel	Start/terminate all pods simultaneously (MongoDB sharded)

Use cases

Kafka — broker ID tied to ordinal; persistent log directories
ZooKeeper / etcd — cluster membership requires stable identity
Databases — PostgreSQL, MongoDB replica sets (often via operators)

🎯 Interview Tip

"Deployment vs StatefulSet?" — Deployment: interchangeable pods, random names, shared storage optional. StatefulSet: stable hostname, ordinal scaling, PVC per pod via volumeClaimTemplates, requires headless Service for per-pod DNS.

⚖️ Trade-off

Running databases in Kubernetes is debated. StatefulSets solve scheduling and storage—not backup, failover logic, or query routing. Production databases usually use operators (CloudNativePG, MongoDB Community Operator) or managed services.

DaemonSets

A DaemonSet ensures exactly one pod per matching node (or per GPU, per zone with advanced selectors). When nodes join the cluster, DaemonSet pods are scheduled automatically; when nodes leave, those pods are garbage-collected.

Use cases

Node-level log collectors (Fluent Bit, Vector, Filebeat)
Monitoring agents (node_exporter, Datadog agent)
CNI network plugins (Calico, Cilium node agents)
Storage daemons (Ceph OSD, GlusterFS)
Security scanners and compliance agents

Update strategy

Strategy	Behavior
RollingUpdate	Replace pods one node at a time (default)
OnDelete	Update only when pod manually deleted

Tolerations

DaemonSets commonly run on control plane nodes too. Add tolerations for control-plane taints:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit
spec:
  selector:
    matchLabels:
      app: fluent-bit
  template:
    metadata:
      labels:
        app: fluent-bit
    spec:
      tolerations:
        - key: node-role.kubernetes.io/control-plane
          operator: Exists
          effect: NoSchedule
        - key: node-role.kubernetes.io/master
          operator: Exists
          effect: NoSchedule
      containers:
        - name: fluent-bit
          image: fluent/fluent-bit:3.0
          volumeMounts:
            - name: varlog
              mountPath: /var/log
              readOnly: true
      volumes:
        - name: varlog
          hostPath:
            path: /var/log

$ kubectl get daemonset -A
$ kubectl rollout status daemonset/fluent-bit -n logging
$ kubectl describe daemonset fluent-bit -n logging$ oc get daemonset -A
$ oc rollout status daemonset/fluent-bit -n logging

🔬 Under the Hood

The DaemonSet controller sets nodeAffinity requiring the pod's node name. It bypasses the default scheduler for placement—pods bind directly to nodes that match the DaemonSet selector and lack an existing pod.

📦 Real World

OpenShift ships cluster logging and monitoring as DaemonSets/Operators managed by the platform. Custom DaemonSets on OCP need SCC grants—hostPath log mounts often require privileged or a custom SCC.

Jobs & CronJobs

Jobs run pods to completion—batch processing, migrations, one-off tasks. CronJobs wrap Jobs on a schedule, like cron on a single server but distributed across the cluster.

Job spec fields

Field	Purpose
completions	Successful pod completions required (default 1)
parallelism	Concurrent pods running at once (default 1)
backoffLimit	Retries before marking Job failed (default 6)
activeDeadlineSeconds	Max duration—terminates running Job after timeout
ttlSecondsAfterFinished	Auto-delete Job after completion (cleanup)

apiVersion: batch/v1
kind: Job
metadata:
  name: etl-daily
spec:
  completions: 5
  parallelism: 2
  backoffLimit: 3
  activeDeadlineSeconds: 3600
  ttlSecondsAfterFinished: 86400
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: etl
          image: myorg/etl:1.4.0
          args: ["--shard", "$(JOB_COMPLETION_INDEX)"]
          env:
            - name: JOB_COMPLETION_INDEX
              valueFrom:
                fieldRef:
                  fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']

CronJob patterns

concurrencyPolicy controls overlapping runs:

Allow — multiple Jobs can run concurrently (default)
Forbid — skip new run if previous still running
Replace — cancel running Job, start new one

apiVersion: batch/v1
kind: CronJob
metadata:
  name: backup-db
spec:
  schedule: "0 2 * * *"
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 1
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: backup
              image: myorg/pg-backup:2.0

$ kubectl create job etl-manual --from=cronjob/backup-db
$ kubectl get jobs -w
$ kubectl logs job/etl-daily
$ kubectl delete job etl-daily$ oc create job etl-manual --from=cronjob/backup-db
$ oc get jobs -w

💡 Pro Tip

For long-running scheduled work, consider an external scheduler (Temporal, Argo Workflows) instead of CronJobs— CronJobs lack dependency graphs, retry policies, and observability that workflow engines provide.

⚠️ Pitfall

Job pods require restartPolicy: Never or OnFailure—not Always. Forgotten completed Jobs accumulate; set ttlSecondsAfterFinished or use a cleanup CronJob.

Horizontal Pod Autoscaler (HPA)

HPA automatically adjusts replicas on Deployments, StatefulSets, or ReplicaSets based on observed metrics. Scale out when load rises; scale in when it drops—within min/max bounds.

Metric sources

CPU / memory — resource metrics via metrics-server (built-in)
Custom metrics — Prometheus adapter, Datadog, etc. (e.g. requests/sec)
External metrics — SQS queue depth, Pub/Sub backlog
KEDA — event-driven autoscaling ScaledObject CRD (Kafka lag, cron, cloud queues)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 15

Behavior & stabilization

behavior (autoscaling/v2) controls scale-up/down velocity and stabilization windows— prevents flapping when metrics oscillate. Scale-down typically uses a longer window than scale-up.

CPU requires requests

HPA CPU utilization is actual usage ÷ requested CPU. Pods without resources.requests.cpu are excluded from average calculation—HPA may not scale correctly. Install metrics-server in every cluster; without it, HPA shows <unknown> metrics.

$ kubectl autoscale deployment web --cpu-percent=70 --min=2 --max=10
$ kubectl get hpa web-hpa -w
NAME      REFERENCE        TARGETS         MINPODS   MAXPODS   REPLICAS
web-hpa   Deployment/web   45%/70%         2         20        3
$ kubectl describe hpa web-hpa$ oc autoscale deployment web --cpu-percent=70 --min=2 --max=10
$ oc get hpa web-hpa -w

⚙️ Config

Set minReplicas ≥ 2 for HA. Pair HPA with PDB (next section) so scale-down and node drains respect availability. For Kafka consumers, prefer KEDA over CPU-based HPA—CPU doesn't reflect lag.

🎯 Interview Tip

"Why isn't HPA scaling?" — Checklist: metrics-server running? CPU requests set? Target utilization reachable? minReplicas already met? Custom metrics adapter registered? HPA events in kubectl describe hpa.

Vertical Pod Autoscaler (VPA)

VPA adjusts CPU and memory requests/limits for containers—not replica count. It learns from historical usage and recommends or applies right-sized resources. Requires the VPA controller (not built into core K8s).

Update modes

Mode	Behavior
Off	Compute recommendations only—display in VPA status, no changes
Initial	Set resources on pod creation only; no updates to running pods
Auto	Evict and recreate pods with updated requests (disruptive)
Recreate	Like Auto—evicts pods when resources change

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: web-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web
  updatePolicy:
    updateMode: "Off"          # start with recommendations only
  resourcePolicy:
    containerPolicies:
      - containerName: nginx
        minAllowed:
          cpu: 50m
          memory: 64Mi
        maxAllowed:
          cpu: 2
          memory: 2Gi

Conflict with HPA on CPU

Do not run VPA Auto and HPA on the same CPU metric—they fight: VPA changes requests (denominator), HPA recalculates utilization (numerator/denominator). Common patterns:

HPA on custom/external metrics + VPA on CPU/memory
VPA in Off mode for recommendations; apply manually in git
HPA on CPU + VPA with controlledResources: [memory] only

⚖️ Trade-off

VPA vs manual sizing: VPA excels at workloads with unpredictable usage (JVM warm-up, batch spikes). For stable microservices, git-managed requests from load tests are simpler and don't cause eviction churn. K8s in-place resize (alpha) may reduce VPA disruption in future releases.

🔬 Under the Hood

VPA Recommender reads metrics from metrics-server (and optionally Prometheus). Updater evicts pods when Auto mode applies new requests. Admission controller injects resources at pod create time.

Pod Disruption Budgets (PDB)

PDBs limit voluntary disruptions—node drains, cluster upgrades, manual pod deletions during maintenance. They do not stop involuntary disruptions (node hardware failure, kubelet killing OOM pods).

minAvailable vs maxUnavailable

Specify exactly one (not both):

minAvailable — minimum pods that must stay available (integer or %)
maxUnavailable — maximum pods that can be unavailable during disruption

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: web
---
# Alternative: percentage-based
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-pdb-pct
spec:
  maxUnavailable: 25%
  selector:
    matchLabels:
      app: web

Interaction with node drain

kubectl drain evicts pods voluntarily. The eviction API checks PDBs— if evicting would violate minAvailable, the drain blocks or waits. Always create PDBs before cluster upgrades on production workloads.

$ kubectl get pdb
NAME      MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
web-pdb   2               N/A               1                     3d
$ kubectl drain worker-2 --ignore-daemonsets --delete-emptydir-data
evicting pod default/web-7d4f8b9c-xk2lm
error when evicting pods/"web-abc123" - cannot evict as it would violate PDB "web-pdb"$ oc get pdb
$ oc adm drain worker-2 --ignore-daemonsets --delete-emptydir-data

⚠️ Pitfall

PDB with single replica — minAvailable: 1 on a 1-replica Deployment blocks all drains. Either run ≥2 replicas for HA or accept maintenance downtime. PDBs on Jobs are usually meaningless.

📦 Real World

Platform teams enforce PDB presence via Kyverno/OPA policies before workloads reach production namespaces. During EKS/GKE/OCP upgrades, stuck drains with PDB violations are the #1 overnight page for stateless apps missing replica count.

OpenShift Workload Additions

OpenShift extends Kubernetes with legacy and developer-centric workload APIs. Modern OCP clusters prefer standard Deployments, but you'll encounter DeploymentConfigs, ImageStreams, and BuildConfigs in brownfield environments.

DeploymentConfig (legacy)

Pre-Deployment OpenShift resource with built-in rollout triggers (ConfigChange, ImageChange). Uses ReplicationControllers instead of ReplicaSets. Deprecated—migrate to apps/v1 Deployment + triggers via ArgoCD or OCP GitOps.

ImageStream

Abstraction over container images—tracks tags, mirrors external registries, triggers redeploys when tags change. Internal registry: image-registry.openshift-image-registry.svc:5000.

apiVersion: image.openshift.io/v1
kind: ImageStream
metadata:
  name: myapp
spec:
  lookupPolicy:
    local: true
  tags:
    - name: latest
      from:
        kind: DockerImage
        name: quay.io/myorg/myapp:2.1.0

BuildConfig

Builds images inside the cluster—Source-to-Image (S2I), Docker, Pipeline strategies. Output pushes to ImageStreamTag, which can trigger Deployment rollout.

$ # kubectl has no equivalent — use CI/CD or buildah locally
$ kubectl create deployment web --image=nginx:1.25
$ kubectl rollout status deployment/web$ oc new-app --name=web nginx:1.25
→ ImageStream, Build (optional), Deployment, Service created
$ oc start-build myapp --from-dir=. --follow
$ oc rollout status dc/web
$ oc rollout history dc/web
$ oc rollout undo dc/web
$ oc set image dc/web web=myapp:latest --trigger

🔴 OpenShift

oc new-app is the fastest onboarding path—creates Deployment (or DeploymentConfig in older templates), Service, Route, and ImageStream in one command. For production, replace with GitOps manifests and external CI building to Quay/ECR.

🎯 Interview Tip

"Deployment vs DeploymentConfig?" — DeploymentConfig is OCP-specific, RC-based, supports ImageChange triggers natively. Deployment is portable K8s standard with ReplicaSets. Red Hat recommends Deployments for new workloads on OCP 4.x.

💡 Pro Tip

oc rollout works on Deployments, DeploymentConfigs, and DaemonSets. Use oc get is to inspect ImageStream tags; oc describe bc for build history and triggers.