Storage, Volumes & PersistentVolumes

Pods are ephemeral; container filesystems vanish on restart. Kubernetes separates ephemeral pod volumes (config, secrets, scratch space) from the PersistentVolume subsystem (PV, PVC, StorageClass) that provisions durable block or file storage via CSI drivers. Get storage wrong and you get data loss on reschedule, PVCs stuck in Pending, or databases that corrupt under multi-writer RWX assumptions.

developer devops architect CSI K8s 1.29+ OpenShift 4.x

Volume types

Not every mount is a PersistentVolume. Pod-level volumes are declared in the Pod spec and exist for the pod's lifetime— some survive container restarts within the same pod, none survive pod deletion unless backed by a PV.

Type Backed by Survives container restart Survives pod delete Typical use
emptyDir Node disk or RAM (medium: Memory) Yes (same pod) No Scratch space, sidecar log shipping, inter-container sharing
configMap / secret API object projected as files Yes No (recreated from API) App config, TLS certs, credentials (prefer External Secrets in prod)
hostPath Directory on the node filesystem Yes No Node agents, dev-only shortcuts—avoid for app data
downwardAPI Pod metadata via API Yes No Expose labels, annotations, resource limits to the container
projected Multiple sources in one mount Yes No Service account token + config + downwardAPI in a single volume
persistentVolumeClaim PV / CSI-provisioned storage Yes Data on PV persists; new pod reattaches via PVC Databases, queues, any durable state

emptyDir

Created when the pod is scheduled to a node. Initially empty; all containers in the pod can read/write the same path. Deleted when the pod is removed from the node. Use medium: Memory for tmpfs-style RAM disks (bounded by node memory; contents count against container memory limits if set).

configMap and secret volumes

Mount keys as individual files under a directory. Updates propagate on kubelet sync interval (not instant)— apps should watch files or use subPath carefully. Secrets are base64-encoded in etcd (encryption at rest is a separate concern; see Configuration Management).

hostPath

Maps a path on the host into the container. Ties data to a specific node; if the pod reschedules elsewhere, the data does not follow. Dangerous with privileged paths—restricted by Pod Security Admission on hardened clusters.

downwardAPI

Projects pod fields (name, namespace, labels, annotations, resource requests/limits) into files or env vars. Useful for observability sidecars that tag logs with pod metadata without talking to the API server.

projected volumes

Combines multiple volume sources into one mount—commonly service account tokens (with audience/expiry), configMaps, secrets, and downwardAPI. Replaces older separate mounts and supports automatic token rotation.

yaml
apiVersion: v1
kind: Pod
metadata:
  name: volume-demo
spec:
  containers:
    - name: app
      image: nginx:1.27
      volumeMounts:
        - name: cache
          mountPath: /cache
        - name: cfg
          mountPath: /etc/app
          readOnly: true
        - name: meta
          mountPath: /etc/podinfo
  volumes:
    - name: cache
      emptyDir:
        medium: Memory
        sizeLimit: 256Mi
    - name: cfg
      configMap:
        name: app-config
    - name: meta
      downwardAPI:
        items:
          - path: labels
            fieldRef:
              fieldPath: metadata.labels
    - name: sa-token
      projected:
        sources:
          - serviceAccountToken:
              path: token
              expirationSeconds: 3600
              audience: api
⚠️ Pitfall

hostPath for application data: A pod rescheduled to another node loses hostPath data silently. DaemonSets using hostPath for logs are fine; stateful apps are not. Use PVCs for anything that must survive node failure.

💡 Pro Tip

Use emptyDir with a log-shipping sidecar: the app writes to a shared emptyDir; Fluent Bit or Vector tails and ships to Loki/Elasticsearch. No PVC cost, no orphaned volumes—logs are ephemeral by design.

Persistent Volume subsystem

PVs represent actual storage capacity; PVCs are namespace-scoped claims; StorageClasses describe how to provision new PVs dynamically. The control plane binds a PVC to a matching PV—or triggers a provisioner to create one.

Core objects

ObjectScopeRole
PersistentVolume (PV)ClusterRepresents a piece of storage (NFS export, EBS volume, Ceph RBD image)
PersistentVolumeClaim (PVC)NamespacePod's request for storage: size, access mode, StorageClass
StorageClassClusterProvisioner, parameters, reclaim policy, binding mode

Dynamic provisioning flow

sequenceDiagram
  participant Dev as Developer
  participant API as API Server
  participant Ctrl as PVC Controller
  participant Prov as CSI Provisioner
  participant Back as Storage Backend

  Dev->>API: Create PVC (storageClassName, size)
  API->>Ctrl: Watch PVC Pending
  Ctrl->>Prov: CreateVolume (CSI CreateVolume)
  Prov->>Back: Allocate disk / export
  Back-->>Prov: Volume ID
  Prov->>API: Create PV (Bound to PVC)
  API-->>Dev: PVC status Bound
  1. User creates a PVC referencing a storageClassName and resources.requests.storage.
  2. PVC controller finds a matching StorageClass and calls the external provisioner (CSI CreateVolume).
  3. Provisioner creates storage in the backend and a PV object with a claimRef pointing to the PVC.
  4. PVC status becomes Bound; pods can reference the PVC in volumes.persistentVolumeClaim.
  5. On pod schedule, kubelet calls CSI NodePublishVolume to mount into the container.
yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pg-data
  namespace: payments
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: gp3-encrypted
  resources:
    requests:
      storage: 100Gi
  volumeMode: Block
---
apiVersion: v1
kind: Pod
metadata:
  name: postgres-0
spec:
  containers:
    - name: postgres
      image: postgres:16
      volumeDevices:
        - name: data
          devicePath: /dev/xvdb
  volumes:
    - name: data
      persistentVolumeClaim:
        claimName: pg-data
bash
# Inspect PVC binding and events
kubectl get pvc -n payments
kubectl describe pvc pg-data -n payments

# List PVs and reclaim status
kubectl get pv

# OpenShift equivalents
oc get pvc -n payments
oc describe pvc pg-data
oc get storageclass
🔬 Under the Hood

The external-provisioner sidecar watches PVCs with a matching storageClassName and no bound PV. It calls the CSI driver's CreateVolume gRPC, then creates a PV with spec.csi fields (driver, volumeHandle, fsType). The in-tree cloud volume plugins are deprecated—CSI is the only supported path for new integrations.

Storage classes

A StorageClass is the platform team's contract with developers: which provisioner runs, default reclaim behavior, whether volumes wait for a pod before binding, and whether expansion is allowed.

Common provisioners

Provisioner Backend Typical access Notes
ebs.csi.aws.com AWS EBS (gp3, io2) RWO AZ-bound; pod must schedule in same AZ as volume
pd.csi.storage.gke.io GCE Persistent Disk RWO Regional PDs for multi-zone resilience
disk.csi.azure.com Azure Disk / Ultra RWO Ultra Disk for low-latency workloads
nfs.csi.k8s.io NFS server export RWX Shared file storage; latency vs block
rook-ceph.rbd.csi.ceph.com Ceph RBD (Rook) RWO On-prem / bare-metal block; ODF uses Rook-Ceph
driver.longhorn.io Longhorn replicated block RWO Popular on edge and homelab clusters

Key StorageClass fields

  • provisioner — CSI driver name that handles create/delete.
  • parameters — Driver-specific key/value pairs (e.g. type: gp3, encrypted: "true", fsType: xfs).
  • reclaimPolicyDelete (default for dynamic) or Retain (PV object kept; backend volume manual cleanup).
  • volumeBindingModeImmediate (provision on PVC create) vs WaitForFirstConsumer (delay until pod is scheduled—critical for topology-aware provisioning).
  • allowVolumeExpansion — Enables PVC resize via kubectl patch when the CSI driver supports it.
yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3-encrypted
  annotations:
    storageclass.kubernetes.io/is-default-class: "false"
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  encrypted: "true"
  fsType: ext4
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
mountOptions:
  - nouuid
⚖️ Trade-off

ReclaimPolicy Retain vs Delete:

Delete — PVC deletion triggers PV deletion and backend volume destruction. Clean for dev/test; risk of accidental data loss if someone deletes the wrong PVC. No orphaned cloud bills from forgotten EBS volumes.

Retain — PVC deletion sets PV to Released; the EBS disk / Ceph image remains. Platform team must manually reclaim or re-bind. Safer for production databases and compliance retention—but requires runbooks and periodic orphan audits.

Many enterprises set Retain on production StorageClasses and Delete on ephemeral dev classes.

⚙️ Config

Mark one StorageClass as cluster default with storageclass.kubernetes.io/is-default-class: "true". PVCs without storageClassName bind to it. On OpenShift, platform defaults often come from the installed storage operator—override explicitly in app manifests to avoid surprises.

bash
# Expand a PVC (requires allowVolumeExpansion + CSI support)
kubectl patch pvc pg-data -n payments -p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'

# Verify expansion status
kubectl get pvc pg-data -n payments -o jsonpath='{.status.conditions}'

# Set default StorageClass
kubectl annotate storageclass gp3-encrypted storageclass.kubernetes.io/is-default-class=true --overwrite

oc patch pvc pg-data -n payments -p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'

Access modes

Access modes describe how many nodes can mount a volume simultaneously—not POSIX file permissions inside the mount. Mismatch between what your app needs and what the StorageClass provides is a top cause of StatefulSet failures.

Mode Abbrev Meaning Typical backend
ReadWriteOnce RWO Single node read/write (multiple pods on same node OK) EBS, Azure Disk, Ceph RBD, Longhorn
ReadOnlyMany ROX Many nodes read-only NFS, CephFS (read-only export), config snapshots
ReadWriteMany RWX Many nodes read/write concurrently NFS, CephFS, Azure Files, EFS
ReadWriteOncePod RWOP Exactly one pod in the cluster (K8s 1.22+) Block volumes where even same-node multi-pod is unsafe

Cloud block storage limitations

AWS EBS, GCE PD, and Azure Managed Disks attach to one node at a time—they support RWO (and RWOP), not RWX. Running two replicas of a Deployment against the same RWO PVC fails: the second pod stays Pending with volume multi-attach errors. For shared files use NFS, EFS, CephFS, or Azure Files—not block StorageClasses.

🎯 Interview Tip

"Can two pods share one PVC?" — Depends on access mode and backend. RWO: only if both pods land on the same node (fragile). RWX: yes, across nodes—requires a file protocol backend. RWOP: never more than one pod cluster-wide. Block + RWX is invalid on most clouds.

🔒 Security

RWX NFS exports increase blast radius—a compromised pod on any node can modify shared data. Scope RWX to namespaces that need it; use fsGroup and supplementalGroups for POSIX permissions; prefer RWO per replica for databases.

Volume lifecycle

From admin-precreated PVs to fully dynamic CSI provisioning, understanding binding and reclaim phases prevents orphaned disks, stuck namespaces, and accidental production data deletion.

Static vs dynamic provisioning

ApproachWho creates PVWhen to use
StaticCluster admin pre-creates PV + backend LUN/exportLegacy SAN/NFS, strict capacity planning, air-gapped without dynamic provisioner
DynamicProvisioner on PVC createDefault for cloud-native; StorageClass drives parameters

PVC / PV phases

stateDiagram-v2
  [*] --> Pending: PVC created
  Pending --> Bound: PV matched or provisioned
  Bound --> [*]: PVC deleted
  note right of Bound
    Pod mounts via kubelet + CSI
  end note
  [*] --> Available: Static PV ready
  Available --> Bound: claimRef set
  Bound --> Released: PVC deleted
  Released --> Available: claimRef cleared (manual)
  Released --> [*]: Reclaim Delete or admin cleanup
  1. Pending — PVC waiting for matching PV or provisioner. Check events: wrong StorageClass, insufficient quota, provisioner down, or WaitForFirstConsumer waiting for a schedulable pod.
  2. Bound — PVC linked to PV; pods can mount. PV claimRef prevents other PVCs from stealing it.
  3. Released — PVC deleted but PV retained (reclaimPolicy: Retain). PV still holds stale claimRef; must be cleared before reuse.
  4. Reclaim — With Delete, external-attacher deletes backend volume and PV object. With Retain, admin snapshots backend and deletes PV manually.
bash
# PVC stuck Pending — check events and StorageClass
kubectl describe pvc my-claim -n app
kubectl get events -n app --field-selector involvedObject.name=my-claim

# Released PV — clear claimRef to reuse (static workflows)
kubectl patch pv pv-001 -p '{"spec":{"claimRef": null}}'

# Find orphaned PVs (Released phase)
kubectl get pv | grep Released

# Force-delete stuck PVC (last resort — understand data impact)
kubectl patch pvc my-claim -n app -p '{"metadata":{"finalizers":null}}' --type=merge
📦 Real World

A common post-incident finding: dev StorageClass with reclaimPolicy: Delete and engineers using the same class name in staging. A Helm uninstall deletes PVCs and wipes staging databases. Fix: separate StorageClasses per environment, Retain on anything stateful, and Velero backups before destructive operations.

Container Storage Interface (CSI)

CSI standardizes how orchestrators talk to storage vendors over gRPC. Each driver ships a controller plugin (provision, attach, snapshot) and node plugin (mount on kubelet host)—replacing in-tree cloud provider code.

CSI components in Kubernetes

  • CSI driver controllerCreateVolume, DeleteVolume, CreateSnapshot, controller publish.
  • CSI driver node — DaemonSet on every node: NodeStageVolume, NodePublishVolume.
  • external-provisioner — Watches PVCs, creates PVs.
  • external-attacher — Watches VolumeAttachment objects for multi-node attach.
  • external-snapshotter — Watches VolumeSnapshot CRs.
  • external-resizer — Handles PVC expansion.

Volume snapshots

Snapshots are cluster-scoped API objects (snapshot.storage.k8s.io/v1). A VolumeSnapshotClass names the driver and deletion policy; a VolumeSnapshot references a PVC or specifies a volume handle. Restore by creating a new PVC with dataSource pointing to the VolumeSnapshot.

yaml
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: csi-aws-snapclass
driver: ebs.csi.aws.com
deletionPolicy: Retain
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: pg-data-snap-20250605
  namespace: payments
spec:
  volumeSnapshotClassName: csi-aws-snapclass
  source:
    persistentVolumeClaimName: pg-data
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pg-data-restored
  namespace: payments
spec:
  storageClassName: gp3-encrypted
  dataSource:
    name: pg-data-snap-20250605
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
bash
# Verify CSI driver and snapshot CRDs
kubectl get csidrivers
kubectl api-resources | grep snapshot

# Create and inspect snapshot
kubectl apply -f pg-snapshot.yaml
kubectl get volumesnapshot -n payments
kubectl describe volumesnapshot pg-data-snap-20250605 -n payments

# List CSI node pods
kubectl get pods -n kube-system -l app=csi-node

oc get volumesnapshot -n payments

In-tree deprecation

Legacy in-tree volume plugins (kubernetes.io/aws-ebs, GCE PD, Azure Disk) are removed. Clusters must migrate to CSI drivers—often via cloud provider migration tools that detach in-tree volumes and reattach via CSI without data loss. New clusters should install only CSI drivers from day one.

🔬 Under the Hood

When a pod mounts a PVC, the attach/detach controller creates a VolumeAttachment object. The external-attacher calls CSI ControllerPublishVolume to attach the disk to the node. The kubelet then invokes the node plugin's NodePublishVolume into the pod's mount namespace—this is why CSI node DaemonSets must run on every worker.

Stateful application patterns

StatefulSets give stable network identity and per-replica PVCs via volumeClaimTemplates. Databases on Kubernetes are viable when you respect storage semantics, backups, and failure domains—not when you treat PVCs like magic infinite disks.

StatefulSet + PVC templates

Each pod name-N gets its own PVC data-name-N from the template. Pods are created ordinally; PVCs persist after StatefulSet scale-down—by design. Use podManagementPolicy: Parallel only when ordering does not matter.

yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: redis-cluster
spec:
  serviceName: redis
  replicas: 3
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
    spec:
      containers:
        - name: redis
          image: redis:7.2
          ports:
            - containerPort: 6379
          volumeMounts:
            - name: data
              mountPath: /data
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: gp3-encrypted
        resources:
          requests:
            storage: 20Gi

Shared RWX pattern

Content farms, CI artifact caches, and legacy apps expecting NFS use a single RWX PVC mounted by multiple Deployment replicas. Requires CephFS, NFS, EFS, or Azure Files—not block StorageClasses. Watch file-lock semantics and performance under concurrent writers.

Database on Kubernetes

  • Single-primary RWO — Postgres/MySQL with one replica and PVC; backups via VolumeSnapshot or logical dump (pg_dump).
  • Managed DB often wins — RDS/Cloud SQL handles patching, failover, and IOPS planning; K8s runs the app tier.
  • When in-cluster — Use operators (CloudNativePG, Percona, Strimzi for Kafka) for failover, backup CRDs, and version upgrades.
  • TopologyWaitForFirstConsumer + pod anti-affinity keeps data in the same AZ as the volume.

Operators

Operators encode day-2 operations: backup schedules, point-in-time recovery, rolling credential rotation, and rebalancing. They reconcile custom resources (e.g. PostgresCluster) into StatefulSets, Services, PVCs, and CronJobs— see Operators & CRDs.

⚖️ Trade-off

Database in K8s vs managed service: In-cluster gives GitOps uniformity and snapshot CRDs but shifts patching, failover, and capacity planning to your platform team. Managed RDS/Aurora trades control plane simplicity for vendor lock-in and network latency. Architects often run stateless on K8s and stateful on managed DB—unless operators and SRE maturity justify full stack ownership.

💡 Pro Tip

Before production cutover, failure-test: delete a worker node, verify StatefulSet pod reschedules and RWO volume reattaches in the same AZ. Simulate AZ outage—regional disks or cross-AZ restore from snapshots should be documented, not discovered during an incident.

OpenShift storage

OpenShift ships opinionated storage operators and default StorageClasses. Platform teams choose between OpenShift Data Foundation (ODF/Rook-Ceph), cloud CSI defaults, NFS provisioners, or Local Storage Operator for bare-metal edge.

ODF / Rook-Ceph

OpenShift Data Foundation (formerly OCS) deploys Rook-managed Ceph via the StorageCluster CR (ocs-storagecluster in openshift-storage namespace). Provides block (RBD), file (CephFS), and object (S3-compatible) from the same cluster—ideal for on-prem and hybrid when cloud block is unavailable.

StorageClass (typical) Provisioner Access Use case
ocs-storagecluster-ceph-rbd rook-ceph.rbd.csi.ceph.com RWO Databases, StatefulSets (default block on ODF)
ocs-storagecluster-cephfs rook-ceph.cephfs.csi.ceph.com RWX Shared content, CI caches, multi-pod read/write
openshift-storage.noobaa.io NooBaa / OBC Object S3-compatible buckets via ObjectBucketClaim

NFS provisioner

For clusters without ODF, OpenShift supports cluster NFS provisioners (legacy in-tree or CSI NFS). Mount an enterprise NAS export once; StorageClass provisions per-PVC subdirectories. Simpler than Ceph but single point of failure at the NAS head.

Local Storage Operator

Local Storage Operator discovers raw disks on specific nodes and creates LocalVolume / LocalVolumeSet CRs. PVs have node affinity—pods must schedule where the disk lives. High performance for Kafka, etcd, or Ceph OSDs; no cross-node resilience unless the app replicates (e.g. Ceph, Kafka RF3).

Platform defaults

Managed OpenShift on AWS/GCP/Azure installs cloud CSI drivers as default StorageClasses. Bare-metal IPI/UPI installs often prompt for ODF or require admin to set a default class. Always inspect oc get sc before deploying stateful workloads—assumptions from vanilla K8s docs may not match the cluster.

bash
# OpenShift storage inspection
oc get storagecluster -n openshift-storage
oc get storageclass
oc describe storageclass ocs-storagecluster-ceph-rbd

# ODF health
oc get cephcluster -n openshift-storage
oc adm top pvc -n payments

# Local Storage Operator
oc get localvolumeset -n openshift-local-storage
oc get pv -l localvolume=
🔴 OpenShift

ODF installs via OperatorHub / OLM as ocs-operator. The StorageCluster CR drives node labels for storage roles (mon, osd, mgr). On compact clusters (3 nodes), all roles colocate—plan CPU/RAM overhead. Use ocs-storagecluster-ceph-rbd for databases and ocs-storagecluster-cephfs only when RWX is required.

🎯 Interview Tip

"How does storage differ on OpenShift vs vanilla K8s?" — OpenShift bundles ODF, Local Storage Operator, default SCC-aware security contexts, and platform monitoring for Ceph health. oc adds adm top pvc and integrated OperatorHub for storage operators. Core PV/PVC/CSI concepts are identical.