Storage, Volumes & PersistentVolumes
Pods are ephemeral; container filesystems vanish on restart. Kubernetes separates ephemeral pod volumes (config, secrets, scratch space) from the PersistentVolume subsystem (PV, PVC, StorageClass) that provisions durable block or file storage via CSI drivers. Get storage wrong and you get data loss on reschedule, PVCs stuck in Pending, or databases that corrupt under multi-writer RWX assumptions.
Volume types
Not every mount is a PersistentVolume. Pod-level volumes are declared in the Pod spec and exist for the pod's lifetime— some survive container restarts within the same pod, none survive pod deletion unless backed by a PV.
| Type | Backed by | Survives container restart | Survives pod delete | Typical use |
|---|---|---|---|---|
| emptyDir | Node disk or RAM (medium: Memory) | Yes (same pod) | No | Scratch space, sidecar log shipping, inter-container sharing |
| configMap / secret | API object projected as files | Yes | No (recreated from API) | App config, TLS certs, credentials (prefer External Secrets in prod) |
| hostPath | Directory on the node filesystem | Yes | No | Node agents, dev-only shortcuts—avoid for app data |
| downwardAPI | Pod metadata via API | Yes | No | Expose labels, annotations, resource limits to the container |
| projected | Multiple sources in one mount | Yes | No | Service account token + config + downwardAPI in a single volume |
| persistentVolumeClaim | PV / CSI-provisioned storage | Yes | Data on PV persists; new pod reattaches via PVC | Databases, queues, any durable state |
emptyDir
Created when the pod is scheduled to a node. Initially empty; all containers in the pod can read/write the same path. Deleted when the pod is removed from the node. Use medium: Memory for tmpfs-style RAM disks (bounded by node memory; contents count against container memory limits if set).
configMap and secret volumes
Mount keys as individual files under a directory. Updates propagate on kubelet sync interval (not instant)— apps should watch files or use subPath carefully. Secrets are base64-encoded in etcd (encryption at rest is a separate concern; see Configuration Management).
hostPath
Maps a path on the host into the container. Ties data to a specific node; if the pod reschedules elsewhere, the data does not follow. Dangerous with privileged paths—restricted by Pod Security Admission on hardened clusters.
downwardAPI
Projects pod fields (name, namespace, labels, annotations, resource requests/limits) into files or env vars. Useful for observability sidecars that tag logs with pod metadata without talking to the API server.
projected volumes
Combines multiple volume sources into one mount—commonly service account tokens (with audience/expiry), configMaps, secrets, and downwardAPI. Replaces older separate mounts and supports automatic token rotation.
┌─── POD ─────────────────────────────────────────────────────┐ │ /data/scratch ← emptyDir [ephemeral, node-local] │ │ /etc/config ← configMap [API-backed files] │ │ /etc/secrets ← secret [API-backed, sensitive] │ │ /var/lib/data ← PVC → PV [durable, CSI/block] │ │ /meta ← downwardAPI [pod labels → files] │ ├─────────────────────────────────────────────────────────────┤ │ container A container B (sidecar reads emptyDir) │ └─────────────────────────────────────────────────────────────┘
apiVersion: v1
kind: Pod
metadata:
name: volume-demo
spec:
containers:
- name: app
image: nginx:1.27
volumeMounts:
- name: cache
mountPath: /cache
- name: cfg
mountPath: /etc/app
readOnly: true
- name: meta
mountPath: /etc/podinfo
volumes:
- name: cache
emptyDir:
medium: Memory
sizeLimit: 256Mi
- name: cfg
configMap:
name: app-config
- name: meta
downwardAPI:
items:
- path: labels
fieldRef:
fieldPath: metadata.labels
- name: sa-token
projected:
sources:
- serviceAccountToken:
path: token
expirationSeconds: 3600
audience: api
hostPath for application data: A pod rescheduled to another node loses hostPath data silently. DaemonSets using hostPath for logs are fine; stateful apps are not. Use PVCs for anything that must survive node failure.
Use emptyDir with a log-shipping sidecar: the app writes to a shared emptyDir; Fluent Bit or Vector tails and ships to Loki/Elasticsearch. No PVC cost, no orphaned volumes—logs are ephemeral by design.
Persistent Volume subsystem
PVs represent actual storage capacity; PVCs are namespace-scoped claims; StorageClasses describe how to provision new PVs dynamically. The control plane binds a PVC to a matching PV—or triggers a provisioner to create one.
Core objects
| Object | Scope | Role |
|---|---|---|
| PersistentVolume (PV) | Cluster | Represents a piece of storage (NFS export, EBS volume, Ceph RBD image) |
| PersistentVolumeClaim (PVC) | Namespace | Pod's request for storage: size, access mode, StorageClass |
| StorageClass | Cluster | Provisioner, parameters, reclaim policy, binding mode |
Dynamic provisioning flow
sequenceDiagram participant Dev as Developer participant API as API Server participant Ctrl as PVC Controller participant Prov as CSI Provisioner participant Back as Storage Backend Dev->>API: Create PVC (storageClassName, size) API->>Ctrl: Watch PVC Pending Ctrl->>Prov: CreateVolume (CSI CreateVolume) Prov->>Back: Allocate disk / export Back-->>Prov: Volume ID Prov->>API: Create PV (Bound to PVC) API-->>Dev: PVC status Bound
- User creates a PVC referencing a storageClassName and resources.requests.storage.
- PVC controller finds a matching StorageClass and calls the external provisioner (CSI CreateVolume).
- Provisioner creates storage in the backend and a PV object with a claimRef pointing to the PVC.
- PVC status becomes Bound; pods can reference the PVC in volumes.persistentVolumeClaim.
- On pod schedule, kubelet calls CSI NodePublishVolume to mount into the container.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pg-data
namespace: payments
spec:
accessModes:
- ReadWriteOnce
storageClassName: gp3-encrypted
resources:
requests:
storage: 100Gi
volumeMode: Block
---
apiVersion: v1
kind: Pod
metadata:
name: postgres-0
spec:
containers:
- name: postgres
image: postgres:16
volumeDevices:
- name: data
devicePath: /dev/xvdb
volumes:
- name: data
persistentVolumeClaim:
claimName: pg-data
# Inspect PVC binding and events
kubectl get pvc -n payments
kubectl describe pvc pg-data -n payments
# List PVs and reclaim status
kubectl get pv
# OpenShift equivalents
oc get pvc -n payments
oc describe pvc pg-data
oc get storageclass
The external-provisioner sidecar watches PVCs with a matching storageClassName and no bound PV. It calls the CSI driver's CreateVolume gRPC, then creates a PV with spec.csi fields (driver, volumeHandle, fsType). The in-tree cloud volume plugins are deprecated—CSI is the only supported path for new integrations.
Storage classes
A StorageClass is the platform team's contract with developers: which provisioner runs, default reclaim behavior, whether volumes wait for a pod before binding, and whether expansion is allowed.
Common provisioners
| Provisioner | Backend | Typical access | Notes |
|---|---|---|---|
| ebs.csi.aws.com | AWS EBS (gp3, io2) | RWO | AZ-bound; pod must schedule in same AZ as volume |
| pd.csi.storage.gke.io | GCE Persistent Disk | RWO | Regional PDs for multi-zone resilience |
| disk.csi.azure.com | Azure Disk / Ultra | RWO | Ultra Disk for low-latency workloads |
| nfs.csi.k8s.io | NFS server export | RWX | Shared file storage; latency vs block |
| rook-ceph.rbd.csi.ceph.com | Ceph RBD (Rook) | RWO | On-prem / bare-metal block; ODF uses Rook-Ceph |
| driver.longhorn.io | Longhorn replicated block | RWO | Popular on edge and homelab clusters |
Key StorageClass fields
- provisioner — CSI driver name that handles create/delete.
- parameters — Driver-specific key/value pairs (e.g. type: gp3, encrypted: "true", fsType: xfs).
- reclaimPolicy — Delete (default for dynamic) or Retain (PV object kept; backend volume manual cleanup).
- volumeBindingMode — Immediate (provision on PVC create) vs WaitForFirstConsumer (delay until pod is scheduled—critical for topology-aware provisioning).
- allowVolumeExpansion — Enables PVC resize via kubectl patch when the CSI driver supports it.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gp3-encrypted
annotations:
storageclass.kubernetes.io/is-default-class: "false"
provisioner: ebs.csi.aws.com
parameters:
type: gp3
encrypted: "true"
fsType: ext4
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
mountOptions:
- nouuid
ReclaimPolicy Retain vs Delete:
Delete — PVC deletion triggers PV deletion and backend volume destruction. Clean for dev/test; risk of accidental data loss if someone deletes the wrong PVC. No orphaned cloud bills from forgotten EBS volumes.
Retain — PVC deletion sets PV to Released; the EBS disk / Ceph image remains. Platform team must manually reclaim or re-bind. Safer for production databases and compliance retention—but requires runbooks and periodic orphan audits.
Many enterprises set Retain on production StorageClasses and Delete on ephemeral dev classes.
Mark one StorageClass as cluster default with storageclass.kubernetes.io/is-default-class: "true". PVCs without storageClassName bind to it. On OpenShift, platform defaults often come from the installed storage operator—override explicitly in app manifests to avoid surprises.
# Expand a PVC (requires allowVolumeExpansion + CSI support)
kubectl patch pvc pg-data -n payments -p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'
# Verify expansion status
kubectl get pvc pg-data -n payments -o jsonpath='{.status.conditions}'
# Set default StorageClass
kubectl annotate storageclass gp3-encrypted storageclass.kubernetes.io/is-default-class=true --overwrite
oc patch pvc pg-data -n payments -p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'
Access modes
Access modes describe how many nodes can mount a volume simultaneously—not POSIX file permissions inside the mount. Mismatch between what your app needs and what the StorageClass provides is a top cause of StatefulSet failures.
| Mode | Abbrev | Meaning | Typical backend |
|---|---|---|---|
| ReadWriteOnce | RWO | Single node read/write (multiple pods on same node OK) | EBS, Azure Disk, Ceph RBD, Longhorn |
| ReadOnlyMany | ROX | Many nodes read-only | NFS, CephFS (read-only export), config snapshots |
| ReadWriteMany | RWX | Many nodes read/write concurrently | NFS, CephFS, Azure Files, EFS |
| ReadWriteOncePod | RWOP | Exactly one pod in the cluster (K8s 1.22+) | Block volumes where even same-node multi-pod is unsafe |
Cloud block storage limitations
AWS EBS, GCE PD, and Azure Managed Disks attach to one node at a time—they support RWO (and RWOP), not RWX. Running two replicas of a Deployment against the same RWO PVC fails: the second pod stays Pending with volume multi-attach errors. For shared files use NFS, EFS, CephFS, or Azure Files—not block StorageClasses.
RWO (block) RWX (shared file) ───────────── ─────────────────── Node A Node A ──┐ └─ Pod 1 ✓ Pod 1 ├── NFS export Node B Node B ──┤ (shared) └─ Pod 2 ✗ (blocked) Pod 2 ┘
"Can two pods share one PVC?" — Depends on access mode and backend. RWO: only if both pods land on the same node (fragile). RWX: yes, across nodes—requires a file protocol backend. RWOP: never more than one pod cluster-wide. Block + RWX is invalid on most clouds.
RWX NFS exports increase blast radius—a compromised pod on any node can modify shared data. Scope RWX to namespaces that need it; use fsGroup and supplementalGroups for POSIX permissions; prefer RWO per replica for databases.
Volume lifecycle
From admin-precreated PVs to fully dynamic CSI provisioning, understanding binding and reclaim phases prevents orphaned disks, stuck namespaces, and accidental production data deletion.
Static vs dynamic provisioning
| Approach | Who creates PV | When to use |
|---|---|---|
| Static | Cluster admin pre-creates PV + backend LUN/export | Legacy SAN/NFS, strict capacity planning, air-gapped without dynamic provisioner |
| Dynamic | Provisioner on PVC create | Default for cloud-native; StorageClass drives parameters |
PVC / PV phases
stateDiagram-v2
[*] --> Pending: PVC created
Pending --> Bound: PV matched or provisioned
Bound --> [*]: PVC deleted
note right of Bound
Pod mounts via kubelet + CSI
end note
[*] --> Available: Static PV ready
Available --> Bound: claimRef set
Bound --> Released: PVC deleted
Released --> Available: claimRef cleared (manual)
Released --> [*]: Reclaim Delete or admin cleanup
- Pending — PVC waiting for matching PV or provisioner. Check events: wrong StorageClass, insufficient quota, provisioner down, or WaitForFirstConsumer waiting for a schedulable pod.
- Bound — PVC linked to PV; pods can mount. PV claimRef prevents other PVCs from stealing it.
- Released — PVC deleted but PV retained (reclaimPolicy: Retain). PV still holds stale claimRef; must be cleared before reuse.
- Reclaim — With Delete, external-attacher deletes backend volume and PV object. With Retain, admin snapshots backend and deletes PV manually.
# PVC stuck Pending — check events and StorageClass
kubectl describe pvc my-claim -n app
kubectl get events -n app --field-selector involvedObject.name=my-claim
# Released PV — clear claimRef to reuse (static workflows)
kubectl patch pv pv-001 -p '{"spec":{"claimRef": null}}'
# Find orphaned PVs (Released phase)
kubectl get pv | grep Released
# Force-delete stuck PVC (last resort — understand data impact)
kubectl patch pvc my-claim -n app -p '{"metadata":{"finalizers":null}}' --type=merge
A common post-incident finding: dev StorageClass with reclaimPolicy: Delete and engineers using the same class name in staging. A Helm uninstall deletes PVCs and wipes staging databases. Fix: separate StorageClasses per environment, Retain on anything stateful, and Velero backups before destructive operations.
Container Storage Interface (CSI)
CSI standardizes how orchestrators talk to storage vendors over gRPC. Each driver ships a controller plugin (provision, attach, snapshot) and node plugin (mount on kubelet host)—replacing in-tree cloud provider code.
CSI components in Kubernetes
- CSI driver controller — CreateVolume, DeleteVolume, CreateSnapshot, controller publish.
- CSI driver node — DaemonSet on every node: NodeStageVolume, NodePublishVolume.
- external-provisioner — Watches PVCs, creates PVs.
- external-attacher — Watches VolumeAttachment objects for multi-node attach.
- external-snapshotter — Watches VolumeSnapshot CRs.
- external-resizer — Handles PVC expansion.
Volume snapshots
Snapshots are cluster-scoped API objects (snapshot.storage.k8s.io/v1). A VolumeSnapshotClass names the driver and deletion policy; a VolumeSnapshot references a PVC or specifies a volume handle. Restore by creating a new PVC with dataSource pointing to the VolumeSnapshot.
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: csi-aws-snapclass
driver: ebs.csi.aws.com
deletionPolicy: Retain
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: pg-data-snap-20250605
namespace: payments
spec:
volumeSnapshotClassName: csi-aws-snapclass
source:
persistentVolumeClaimName: pg-data
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pg-data-restored
namespace: payments
spec:
storageClassName: gp3-encrypted
dataSource:
name: pg-data-snap-20250605
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
# Verify CSI driver and snapshot CRDs
kubectl get csidrivers
kubectl api-resources | grep snapshot
# Create and inspect snapshot
kubectl apply -f pg-snapshot.yaml
kubectl get volumesnapshot -n payments
kubectl describe volumesnapshot pg-data-snap-20250605 -n payments
# List CSI node pods
kubectl get pods -n kube-system -l app=csi-node
oc get volumesnapshot -n payments
In-tree deprecation
Legacy in-tree volume plugins (kubernetes.io/aws-ebs, GCE PD, Azure Disk) are removed. Clusters must migrate to CSI drivers—often via cloud provider migration tools that detach in-tree volumes and reattach via CSI without data loss. New clusters should install only CSI drivers from day one.
When a pod mounts a PVC, the attach/detach controller creates a VolumeAttachment object. The external-attacher calls CSI ControllerPublishVolume to attach the disk to the node. The kubelet then invokes the node plugin's NodePublishVolume into the pod's mount namespace—this is why CSI node DaemonSets must run on every worker.
Stateful application patterns
StatefulSets give stable network identity and per-replica PVCs via volumeClaimTemplates. Databases on Kubernetes are viable when you respect storage semantics, backups, and failure domains—not when you treat PVCs like magic infinite disks.
StatefulSet + PVC templates
Each pod name-N gets its own PVC data-name-N from the template. Pods are created ordinally; PVCs persist after StatefulSet scale-down—by design. Use podManagementPolicy: Parallel only when ordering does not matter.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: redis-cluster
spec:
serviceName: redis
replicas: 3
selector:
matchLabels:
app: redis
template:
metadata:
labels:
app: redis
spec:
containers:
- name: redis
image: redis:7.2
ports:
- containerPort: 6379
volumeMounts:
- name: data
mountPath: /data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: gp3-encrypted
resources:
requests:
storage: 20Gi
Shared RWX pattern
Content farms, CI artifact caches, and legacy apps expecting NFS use a single RWX PVC mounted by multiple Deployment replicas. Requires CephFS, NFS, EFS, or Azure Files—not block StorageClasses. Watch file-lock semantics and performance under concurrent writers.
Database on Kubernetes
- Single-primary RWO — Postgres/MySQL with one replica and PVC; backups via VolumeSnapshot or logical dump (pg_dump).
- Managed DB often wins — RDS/Cloud SQL handles patching, failover, and IOPS planning; K8s runs the app tier.
- When in-cluster — Use operators (CloudNativePG, Percona, Strimzi for Kafka) for failover, backup CRDs, and version upgrades.
- Topology — WaitForFirstConsumer + pod anti-affinity keeps data in the same AZ as the volume.
Operators
Operators encode day-2 operations: backup schedules, point-in-time recovery, rolling credential rotation, and rebalancing. They reconcile custom resources (e.g. PostgresCluster) into StatefulSets, Services, PVCs, and CronJobs— see Operators & CRDs.
Database in K8s vs managed service: In-cluster gives GitOps uniformity and snapshot CRDs but shifts patching, failover, and capacity planning to your platform team. Managed RDS/Aurora trades control plane simplicity for vendor lock-in and network latency. Architects often run stateless on K8s and stateful on managed DB—unless operators and SRE maturity justify full stack ownership.
Before production cutover, failure-test: delete a worker node, verify StatefulSet pod reschedules and RWO volume reattaches in the same AZ. Simulate AZ outage—regional disks or cross-AZ restore from snapshots should be documented, not discovered during an incident.
OpenShift storage
OpenShift ships opinionated storage operators and default StorageClasses. Platform teams choose between OpenShift Data Foundation (ODF/Rook-Ceph), cloud CSI defaults, NFS provisioners, or Local Storage Operator for bare-metal edge.
ODF / Rook-Ceph
OpenShift Data Foundation (formerly OCS) deploys Rook-managed Ceph via the StorageCluster CR (ocs-storagecluster in openshift-storage namespace). Provides block (RBD), file (CephFS), and object (S3-compatible) from the same cluster—ideal for on-prem and hybrid when cloud block is unavailable.
| StorageClass (typical) | Provisioner | Access | Use case |
|---|---|---|---|
| ocs-storagecluster-ceph-rbd | rook-ceph.rbd.csi.ceph.com | RWO | Databases, StatefulSets (default block on ODF) |
| ocs-storagecluster-cephfs | rook-ceph.cephfs.csi.ceph.com | RWX | Shared content, CI caches, multi-pod read/write |
| openshift-storage.noobaa.io | NooBaa / OBC | Object | S3-compatible buckets via ObjectBucketClaim |
NFS provisioner
For clusters without ODF, OpenShift supports cluster NFS provisioners (legacy in-tree or CSI NFS). Mount an enterprise NAS export once; StorageClass provisions per-PVC subdirectories. Simpler than Ceph but single point of failure at the NAS head.
Local Storage Operator
Local Storage Operator discovers raw disks on specific nodes and creates LocalVolume / LocalVolumeSet CRs. PVs have node affinity—pods must schedule where the disk lives. High performance for Kafka, etcd, or Ceph OSDs; no cross-node resilience unless the app replicates (e.g. Ceph, Kafka RF3).
Platform defaults
Managed OpenShift on AWS/GCP/Azure installs cloud CSI drivers as default StorageClasses. Bare-metal IPI/UPI installs often prompt for ODF or require admin to set a default class. Always inspect oc get sc before deploying stateful workloads—assumptions from vanilla K8s docs may not match the cluster.
# OpenShift storage inspection
oc get storagecluster -n openshift-storage
oc get storageclass
oc describe storageclass ocs-storagecluster-ceph-rbd
# ODF health
oc get cephcluster -n openshift-storage
oc adm top pvc -n payments
# Local Storage Operator
oc get localvolumeset -n openshift-local-storage
oc get pv -l localvolume=
ODF installs via OperatorHub / OLM as ocs-operator. The StorageCluster CR drives node labels for storage roles (mon, osd, mgr). On compact clusters (3 nodes), all roles colocate—plan CPU/RAM overhead. Use ocs-storagecluster-ceph-rbd for databases and ocs-storagecluster-cephfs only when RWX is required.
"How does storage differ on OpenShift vs vanilla K8s?" — OpenShift bundles ODF, Local Storage Operator, default SCC-aware security contexts, and platform monitoring for Ceph health. oc adds adm top pvc and integrated OperatorHub for storage operators. Core PV/PVC/CSI concepts are identical.