Workloads
Everything Kubernetes runs is ultimately a Pod—but you almost never create bare pods in production. Controllers (Deployment, StatefulSet, DaemonSet, Job) encode intent: how many replicas, how to roll out, how to heal, and how to scale. This chapter covers the workload API surface from atomic pods through autoscaling and disruption budgets, plus OpenShift-specific workload primitives.
Pod — The Atomic Unit
A Pod is the smallest deployable unit in Kubernetes—not a container, but a group of one or more containers that share a network namespace, optional IPC, and optionally volumes. The scheduler places pods; controllers create and manage them.
Shared network & IPC
All containers in a pod share the same Pod IP and port space—localhost between containers works. They can share an IPC namespace (shareProcessNamespace: true) for sidecar debugging or legacy patterns. Volumes declared at pod spec level are mounted into selected containers.
Spec anatomy
| Field | Purpose |
|---|---|
| spec.containers[] | Main application containers (required, at least one) |
| spec.initContainers[] | Run to completion before app containers start |
| spec.volumes[] | Shared storage—emptyDir, PVC, ConfigMap, Secret |
| spec.nodeName | Bypass scheduler—bind pod to specific node (rare) |
| spec.restartPolicy | Always (default), OnFailure, Never |
| spec.serviceAccountName | Identity for API access and image pull secrets |
| spec.terminationGracePeriodSeconds | Time between SIGTERM and SIGKILL (default 30s) |
Lifecycle phases
Pod status.phase is coarse-grained. Fine-grained readiness comes from status.conditions and per-container state.
stateDiagram-v2 [*] --> Pending: scheduled / pulling image Pending --> Running: all containers started Running --> Succeeded: all exit 0 (restartPolicy Never/OnFailure) Running --> Failed: container error / OOM / evicted Running --> Unknown: node lost contact Succeeded --> [*] Failed --> [*]
| Phase | Meaning | Typical cause |
|---|---|---|
| Pending | Accepted but not all containers running | Scheduling, image pull, init containers, PVC binding |
| Running | At least one container running or starting | Normal operation |
| Succeeded | All containers terminated successfully | Job pods, one-shot tasks |
| Failed | At least one container failed; none running | CrashLoopBackOff, OOMKilled, exit non-zero |
Conditions & container states
- PodScheduled — scheduler assigned a node
- Initialized — all init containers completed
- ContainersReady — all containers pass readiness probes
- Ready — pod can receive Service traffic
Container state is one of: Waiting (reason: ContainerCreating, CrashLoopBackOff, ImagePullBackOff), Running, or Terminated (exit code, signal, OOM flag).
$ kubectl get pod web-7d4f8b9c-xk2lm -o wide NAME READY STATUS RESTARTS AGE IP NODE web-7d4f8b9c-xk2lm 1/1 Running 0 5m 10.244.1.15 worker-2 $ kubectl describe pod web-7d4f8b9c-xk2lm | grep -A5 Conditions $ kubectl get pod web-7d4f8b9c-xk2lm -o jsonpath='{.status.podIP}' 10.244.1.15$ oc get pod web-7d4f8b9c-xk2lm -o wide $ oc describe pod web-7d4f8b9c-xk2lm $ oc get pod web-7d4f8b9c-xk2lm -o jsonpath='{.status.podIP}'
Never run bare pods in production. A standalone Pod is not self-healing—delete it and it stays gone. Node failure loses the workload permanently. Always use a controller (Deployment, StatefulSet, etc.) that owns the pod template.
The kubelet assigns a pod IP from the CNI plugin's range on that node. When the pod dies, the IP is released— never hardcode pod IPs. Services provide stable virtual IPs; headless Services return pod DNS for StatefulSets.
"What's the difference between pod phase and container state?" — Phase is pod-level summary; container state is per-container (waiting/running/terminated). A pod can be Running while a sidecar is in CrashLoopBackOff if the main container is up but not all are ready (READY 1/2).
Init Containers
Init containers run sequentially before app containers start. Each must exit successfully before the next begins. Use them for setup tasks that must complete before the main process runs.
Sequential execution
Order follows the array in spec.initContainers. If any init container fails (non-zero exit), Kubernetes restarts the pod according to restartPolicy—with backoff for repeated failures.
Common use cases
- Wait for dependencies—database migrations, service mesh proxy registration
- Fetch secrets or config from external systems into a shared emptyDir
- Run database schema migrations before app starts
- Set filesystem permissions on volumes (fix ownership for non-root UIDs)
- Clone git repos or download artifacts into a shared volume
apiVersion: v1
kind: Pod
metadata:
name: app-with-init
spec:
initContainers:
- name: wait-for-db
image: busybox:1.36
command: ['sh', '-c', 'until nc -z postgres 5432; do sleep 2; done']
- name: migrate
image: myapp:2.1.0
command: ['./migrate.sh']
containers:
- name: app
image: myapp:2.1.0
ports:
- containerPort: 8080
Sidecar containers (KEP-753)
Kubernetes 1.29+ supports native sidecars: init containers with restartPolicy: Always start before app containers, keep running alongside them, and terminate after app containers exit—proper lifecycle for service mesh proxies and log shippers.
spec:
initContainers:
- name: istio-proxy
image: istio/proxyv2:1.22
restartPolicy: Always # KEP-753 sidecar
ports:
- containerPort: 15090
containers:
- name: app
image: myapp:2.1.0
Before KEP-753, teams used regular containers as sidecars—but they started in parallel with the app, causing race conditions. Prefer restartPolicy: Always on init containers for mesh/logging sidecars on K8s 1.29+.
Init containers vs Jobs: Init runs inside the pod lifecycle—good for per-pod setup. A Kubernetes Job is better for one-time cluster-wide migrations or batch prep that shouldn't block every pod restart.
Deployments
A Deployment declares desired state for stateless applications. It owns a ReplicaSet, which owns Pods. Change the pod template → new ReplicaSet → rolling update replaces old pods incrementally.
ReplicaSet + rolling update
Each template change creates a new ReplicaSet with a unique pod-template-hash label. The Deployment controller scales the new RS up and the old RS down according to the update strategy.
apiVersion: apps/v1
kind: Deployment
metadata:
name: web
labels:
app: web
spec:
replicas: 3
revisionHistoryLimit: 5
selector:
matchLabels:
app: web
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1 # max pods below desired during update
maxSurge: 1 # max extra pods above desired during update
template:
metadata:
labels:
app: web
spec:
containers:
- name: nginx
image: nginx:1.25
ports:
- containerPort: 80
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 256Mi
readinessProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 5
periodSeconds: 5
Update strategies
| Strategy | Behavior | When to use |
|---|---|---|
| RollingUpdate | Replace pods incrementally via maxUnavailable/maxSurge | Default—zero-downtime with readiness probes |
| Recreate | Terminate all old pods, then create new ones | Single-replica apps, incompatible versions, dev/staging |
maxUnavailable and maxSurge accept integers or percentages. With 3 replicas and maxUnavailable: 1, maxSurge: 1, you may briefly run 4 pods (3 old + 1 new) or drop to 2 (2 old while 1 new starts).
Rollback, pause, and revision history
- revisionHistoryLimit — old ReplicaSets kept for rollback (default 10)
- kubectl rollout pause — freeze mid-update for canary testing
- kubectl rollout resume — continue paused rollout
- kubectl rollout undo — revert to previous ReplicaSet
$ kubectl set image deployment/web nginx=nginx:1.26 $ kubectl rollout status deployment/web --timeout=5m $ kubectl rollout history deployment/web $ kubectl rollout undo deployment/web --to-revision=2 $ kubectl rollout pause deployment/web $ kubectl rollout resume deployment/web$ oc set image deployment/web nginx=nginx:1.26 $ oc rollout status deployment/web $ oc rollout history deployment/web $ oc rollout undo deployment/web --to-revision=2
Production rolling updates require readiness probes. Without them, Kubernetes considers pods ready as soon as the container starts— traffic hits half-initialized apps. Set progressDeadlineSeconds (default 600) to fail stuck rollouts.
Teams using GitOps (ArgoCD/Flux) rarely run kubectl set image manually—image tag changes flow through git. Rollback becomes git revert plus sync. Keep rollout undo in your incident playbook for emergencies.
Immutable label selectors — you cannot change spec.selector on an existing Deployment. Changing pod labels without updating the selector orphan pods. Use kubectl apply --server-side carefully.
StatefulSets
StatefulSets manage pods that need stable network identity and stable storage. Pods get predictable names (web-0, web-1, web-2) and persistent volumes that follow them across reschedules.
Stable names & DNS
Pod hostname is <statefulset-name>-<ordinal>. With a headless Service named web, DNS records resolve per pod:
- web-0.web.default.svc.cluster.local
- web-1.web.default.svc.cluster.local
Ordered startup & termination
Pods start sequentially: web-0 must be Running and Ready before web-1 starts. Scale-down terminates highest ordinal first (web-2 before web-1).
apiVersion: v1
kind: Service
metadata:
name: kafka
spec:
clusterIP: None # headless — required for StatefulSet DNS
selector:
app: kafka
ports:
- port: 9092
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: kafka
spec:
serviceName: kafka # links to headless Service
replicas: 3
podManagementPolicy: OrderedReady
updateStrategy:
type: RollingUpdate
selector:
matchLabels:
app: kafka
template:
metadata:
labels:
app: kafka
spec:
containers:
- name: kafka
image: apache/kafka:3.7.0
ports:
- containerPort: 9092
volumeMounts:
- name: data
mountPath: /var/lib/kafka
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi
Update strategies
| Strategy | Behavior |
|---|---|
| RollingUpdate | Update pods in reverse ordinal order (default) |
| OnDelete | Manual—delete each pod to trigger update |
| podManagementPolicy: Parallel | Start/terminate all pods simultaneously (MongoDB sharded) |
Use cases
- Kafka — broker ID tied to ordinal; persistent log directories
- ZooKeeper / etcd — cluster membership requires stable identity
- Databases — PostgreSQL, MongoDB replica sets (often via operators)
"Deployment vs StatefulSet?" — Deployment: interchangeable pods, random names, shared storage optional. StatefulSet: stable hostname, ordinal scaling, PVC per pod via volumeClaimTemplates, requires headless Service for per-pod DNS.
Running databases in Kubernetes is debated. StatefulSets solve scheduling and storage—not backup, failover logic, or query routing. Production databases usually use operators (CloudNativePG, MongoDB Community Operator) or managed services.
DaemonSets
A DaemonSet ensures exactly one pod per matching node (or per GPU, per zone with advanced selectors). When nodes join the cluster, DaemonSet pods are scheduled automatically; when nodes leave, those pods are garbage-collected.
Use cases
- Node-level log collectors (Fluent Bit, Vector, Filebeat)
- Monitoring agents (node_exporter, Datadog agent)
- CNI network plugins (Calico, Cilium node agents)
- Storage daemons (Ceph OSD, GlusterFS)
- Security scanners and compliance agents
Update strategy
| Strategy | Behavior |
|---|---|
| RollingUpdate | Replace pods one node at a time (default) |
| OnDelete | Update only when pod manually deleted |
Tolerations
DaemonSets commonly run on control plane nodes too. Add tolerations for control-plane taints:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluent-bit
spec:
selector:
matchLabels:
app: fluent-bit
template:
metadata:
labels:
app: fluent-bit
spec:
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
containers:
- name: fluent-bit
image: fluent/fluent-bit:3.0
volumeMounts:
- name: varlog
mountPath: /var/log
readOnly: true
volumes:
- name: varlog
hostPath:
path: /var/log
$ kubectl get daemonset -A $ kubectl rollout status daemonset/fluent-bit -n logging $ kubectl describe daemonset fluent-bit -n logging$ oc get daemonset -A $ oc rollout status daemonset/fluent-bit -n logging
The DaemonSet controller sets nodeAffinity requiring the pod's node name. It bypasses the default scheduler for placement—pods bind directly to nodes that match the DaemonSet selector and lack an existing pod.
OpenShift ships cluster logging and monitoring as DaemonSets/Operators managed by the platform. Custom DaemonSets on OCP need SCC grants—hostPath log mounts often require privileged or a custom SCC.
Jobs & CronJobs
Jobs run pods to completion—batch processing, migrations, one-off tasks. CronJobs wrap Jobs on a schedule, like cron on a single server but distributed across the cluster.
Job spec fields
| Field | Purpose |
|---|---|
| completions | Successful pod completions required (default 1) |
| parallelism | Concurrent pods running at once (default 1) |
| backoffLimit | Retries before marking Job failed (default 6) |
| activeDeadlineSeconds | Max duration—terminates running Job after timeout |
| ttlSecondsAfterFinished | Auto-delete Job after completion (cleanup) |
apiVersion: batch/v1
kind: Job
metadata:
name: etl-daily
spec:
completions: 5
parallelism: 2
backoffLimit: 3
activeDeadlineSeconds: 3600
ttlSecondsAfterFinished: 86400
template:
spec:
restartPolicy: Never
containers:
- name: etl
image: myorg/etl:1.4.0
args: ["--shard", "$(JOB_COMPLETION_INDEX)"]
env:
- name: JOB_COMPLETION_INDEX
valueFrom:
fieldRef:
fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
CronJob patterns
concurrencyPolicy controls overlapping runs:
- Allow — multiple Jobs can run concurrently (default)
- Forbid — skip new run if previous still running
- Replace — cancel running Job, start new one
apiVersion: batch/v1
kind: CronJob
metadata:
name: backup-db
spec:
schedule: "0 2 * * *"
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 1
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: backup
image: myorg/pg-backup:2.0
$ kubectl create job etl-manual --from=cronjob/backup-db $ kubectl get jobs -w $ kubectl logs job/etl-daily $ kubectl delete job etl-daily$ oc create job etl-manual --from=cronjob/backup-db $ oc get jobs -w
For long-running scheduled work, consider an external scheduler (Temporal, Argo Workflows) instead of CronJobs— CronJobs lack dependency graphs, retry policies, and observability that workflow engines provide.
Job pods require restartPolicy: Never or OnFailure—not Always. Forgotten completed Jobs accumulate; set ttlSecondsAfterFinished or use a cleanup CronJob.
Horizontal Pod Autoscaler (HPA)
HPA automatically adjusts replicas on Deployments, StatefulSets, or ReplicaSets based on observed metrics. Scale out when load rises; scale in when it drops—within min/max bounds.
Metric sources
- CPU / memory — resource metrics via metrics-server (built-in)
- Custom metrics — Prometheus adapter, Datadog, etc. (e.g. requests/sec)
- External metrics — SQS queue depth, Pub/Sub backlog
- KEDA — event-driven autoscaling ScaledObject CRD (Kafka lag, cron, cloud queues)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
Behavior & stabilization
behavior (autoscaling/v2) controls scale-up/down velocity and stabilization windows— prevents flapping when metrics oscillate. Scale-down typically uses a longer window than scale-up.
CPU requires requests
HPA CPU utilization is actual usage ÷ requested CPU. Pods without resources.requests.cpu are excluded from average calculation—HPA may not scale correctly. Install metrics-server in every cluster; without it, HPA shows <unknown> metrics.
$ kubectl autoscale deployment web --cpu-percent=70 --min=2 --max=10 $ kubectl get hpa web-hpa -w NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS web-hpa Deployment/web 45%/70% 2 20 3 $ kubectl describe hpa web-hpa$ oc autoscale deployment web --cpu-percent=70 --min=2 --max=10 $ oc get hpa web-hpa -w
Set minReplicas ≥ 2 for HA. Pair HPA with PDB (next section) so scale-down and node drains respect availability. For Kafka consumers, prefer KEDA over CPU-based HPA—CPU doesn't reflect lag.
"Why isn't HPA scaling?" — Checklist: metrics-server running? CPU requests set? Target utilization reachable? minReplicas already met? Custom metrics adapter registered? HPA events in kubectl describe hpa.
Vertical Pod Autoscaler (VPA)
VPA adjusts CPU and memory requests/limits for containers—not replica count. It learns from historical usage and recommends or applies right-sized resources. Requires the VPA controller (not built into core K8s).
Update modes
| Mode | Behavior |
|---|---|
| Off | Compute recommendations only—display in VPA status, no changes |
| Initial | Set resources on pod creation only; no updates to running pods |
| Auto | Evict and recreate pods with updated requests (disruptive) |
| Recreate | Like Auto—evicts pods when resources change |
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: web-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: web
updatePolicy:
updateMode: "Off" # start with recommendations only
resourcePolicy:
containerPolicies:
- containerName: nginx
minAllowed:
cpu: 50m
memory: 64Mi
maxAllowed:
cpu: 2
memory: 2Gi
Conflict with HPA on CPU
Do not run VPA Auto and HPA on the same CPU metric—they fight: VPA changes requests (denominator), HPA recalculates utilization (numerator/denominator). Common patterns:
- HPA on custom/external metrics + VPA on CPU/memory
- VPA in Off mode for recommendations; apply manually in git
- HPA on CPU + VPA with controlledResources: [memory] only
VPA vs manual sizing: VPA excels at workloads with unpredictable usage (JVM warm-up, batch spikes). For stable microservices, git-managed requests from load tests are simpler and don't cause eviction churn. K8s in-place resize (alpha) may reduce VPA disruption in future releases.
VPA Recommender reads metrics from metrics-server (and optionally Prometheus). Updater evicts pods when Auto mode applies new requests. Admission controller injects resources at pod create time.
Pod Disruption Budgets (PDB)
PDBs limit voluntary disruptions—node drains, cluster upgrades, manual pod deletions during maintenance. They do not stop involuntary disruptions (node hardware failure, kubelet killing OOM pods).
minAvailable vs maxUnavailable
Specify exactly one (not both):
- minAvailable — minimum pods that must stay available (integer or %)
- maxUnavailable — maximum pods that can be unavailable during disruption
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: web
---
# Alternative: percentage-based
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-pdb-pct
spec:
maxUnavailable: 25%
selector:
matchLabels:
app: web
Interaction with node drain
kubectl drain evicts pods voluntarily. The eviction API checks PDBs— if evicting would violate minAvailable, the drain blocks or waits. Always create PDBs before cluster upgrades on production workloads.
$ kubectl get pdb NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE web-pdb 2 N/A 1 3d $ kubectl drain worker-2 --ignore-daemonsets --delete-emptydir-data evicting pod default/web-7d4f8b9c-xk2lm error when evicting pods/"web-abc123" - cannot evict as it would violate PDB "web-pdb"$ oc get pdb $ oc adm drain worker-2 --ignore-daemonsets --delete-emptydir-data
PDB with single replica — minAvailable: 1 on a 1-replica Deployment blocks all drains. Either run ≥2 replicas for HA or accept maintenance downtime. PDBs on Jobs are usually meaningless.
Platform teams enforce PDB presence via Kyverno/OPA policies before workloads reach production namespaces. During EKS/GKE/OCP upgrades, stuck drains with PDB violations are the #1 overnight page for stateless apps missing replica count.
OpenShift Workload Additions
OpenShift extends Kubernetes with legacy and developer-centric workload APIs. Modern OCP clusters prefer standard Deployments, but you'll encounter DeploymentConfigs, ImageStreams, and BuildConfigs in brownfield environments.
DeploymentConfig (legacy)
Pre-Deployment OpenShift resource with built-in rollout triggers (ConfigChange, ImageChange). Uses ReplicationControllers instead of ReplicaSets. Deprecated—migrate to apps/v1 Deployment + triggers via ArgoCD or OCP GitOps.
ImageStream
Abstraction over container images—tracks tags, mirrors external registries, triggers redeploys when tags change. Internal registry: image-registry.openshift-image-registry.svc:5000.
apiVersion: image.openshift.io/v1
kind: ImageStream
metadata:
name: myapp
spec:
lookupPolicy:
local: true
tags:
- name: latest
from:
kind: DockerImage
name: quay.io/myorg/myapp:2.1.0
BuildConfig
Builds images inside the cluster—Source-to-Image (S2I), Docker, Pipeline strategies. Output pushes to ImageStreamTag, which can trigger Deployment rollout.
$ # kubectl has no equivalent — use CI/CD or buildah locally $ kubectl create deployment web --image=nginx:1.25 $ kubectl rollout status deployment/web$ oc new-app --name=web nginx:1.25 → ImageStream, Build (optional), Deployment, Service created $ oc start-build myapp --from-dir=. --follow $ oc rollout status dc/web $ oc rollout history dc/web $ oc rollout undo dc/web $ oc set image dc/web web=myapp:latest --trigger
oc new-app is the fastest onboarding path—creates Deployment (or DeploymentConfig in older templates), Service, Route, and ImageStream in one command. For production, replace with GitOps manifests and external CI building to Quay/ECR.
"Deployment vs DeploymentConfig?" — DeploymentConfig is OCP-specific, RC-based, supports ImageChange triggers natively. Deployment is portable K8s standard with ReplicaSets. Red Hat recommends Deployments for new workloads on OCP 4.x.
oc rollout works on Deployments, DeploymentConfigs, and DaemonSets. Use oc get is to inspect ImageStream tags; oc describe bc for build history and triggers.