Home
Production Ops
Cheat Sheets
Hub
K8s Core
Cheat Sheets
Cheat Sheets
Three exhaustive quick references by persona—daily kubectl /oc
workflows, platform operations, and cluster architecture decisions.
Use Copy sheet for plain text, the CLI toggle for OpenShift variants, or print (Cmd/Ctrl+P).
developer
devops
architect
K8s 1.28+
OCP 4.16+
CKA/CKAD/CKS
Developer
DevOps
Architect
Context, namespace & discovery
Task Command Notes
List contexts kubectl config get-contexts* = current
Switch context kubectl config use-context prod-eastCluster + user + namespace
Current context kubectl config current-context—
Set default namespace kubectl config set-context --current --namespace=devAvoid -n on every command
List all namespaces kubectl get nsoc get projects on OCP
Explain API field kubectl explain pod.spec.containers.livenessProbeRecursive: --recursive
API resources kubectl api-resourcesShort names: po, deploy, svc
Watch resource kubectl get pods -w--watch-only skips initial list
kubectl get — common flags
Flag Purpose Example
-nNamespace kubectl get pods -n prod
-A / --all-namespacesAll namespaces kubectl get pods -A
-lLabel selector -l app=web,env=prod
-o wideExtra columns (node, IP) kubectl get pods -o wide
-o yaml / -o jsonFull manifest Pipe to grep / save template
-o nameResource name only For loops: for p in $(kubectl get pods -o name); do …
--sort-bySort output --sort-by=.metadata.creationTimestamp
--field-selectorFilter by field status.phase=Pending
--show-labelsPrint labels column Debug selector mismatches
-wWatch stream Live deploy monitoring
Apply, create & imperative shortcuts
Command Purpose Notes
kubectl apply -f manifest.yamlDeclarative create/update Preferred for GitOps
kubectl apply -f dir/Apply all YAML in directory Recursive with -R
kubectl apply -k overlays/devKustomize overlay Built-in Kustomize
kubectl apply --dry-run=client -o yamlClient-side dry run Generate manifest without server
kubectl apply --server-sideServer-side apply Field ownership tracking
kubectl diff -f manifest.yamlPreview changes CI gate before apply
kubectl create deployment web --image=nginx:1.25Imperative deploy Quick test; export to YAML after
kubectl expose deployment web --port=80Create Service ClusterIP by default
kubectl scale deployment web --replicas=5Scale replicas HPA overrides manual scale
kubectl delete -f manifest.yamlDelete by manifest --ignore-not-found in scripts
kubectl replace --force -f pod.yamlDelete + recreate Disruptive; avoid in prod
# Export running deployment to YAML (strip cluster metadata)
kubectl get deploy web -o yaml | kubectl neat > web.yaml
# Generate YAML without creating (exam pattern)
kubectl create deployment web --image=nginx --dry-run=client -o yaml > deploy.yaml oc new-app --name=web nginx:1.25
oc expose dc/web
oc get route web -o jsonpath='{.spec.host}'
oc status
oc logs -f dc/web
Pod lifecycle, phases & conditions
Status / Phase Meaning First diagnostic step
PendingNot yet scheduled or image pulling kubectl describe pod → Events; check scheduler/taints/PVC
ContainerCreatingScheduled; pulling image / mounting volumes Events; describe volume mounts
RunningAt least one container running Check readiness if not receiving traffic
SucceededAll containers exited 0 (Job) Expected for batch; check Job status
FailedContainer exited non-zero kubectl logs --previous
CrashLoopBackOffContainer crash → backoff retry logs --previous; check exit code, probes, config
ImagePullBackOffCannot pull image Verify tag, registry auth, imagePullSecrets
ErrImagePullImmediate pull failure Same as ImagePullBackOff
OOMKilledMemory limit exceeded Raise limits.memory or fix leak
EvictedNode pressure eviction kubectl describe node; disk/memory pressure
TerminatingDeletion in progress Stuck? Check finalizers, PDB, grace period
Condition True means
PodScheduledAssigned to a node
InitializedInit containers completed
ContainersReadyAll containers passed readiness
ReadyPod eligible for Service endpoints
Probes — liveness, readiness, startup
Probe Purpose Failure action Typical settings
readinessProbe Ready for traffic? Removed from Service endpoints periodSeconds: 10, failureThreshold: 3
livenessProbe Process alive? Container restart Less aggressive than readiness; avoid DB checks
startupProbe Slow-start apps Blocks liveness until success failureThreshold: 30 for JVM warm-up
readinessProbe:
httpGet: { path: /health/ready, port: 8080, scheme: HTTP }
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 2
failureThreshold: 3
livenessProbe:
httpGet: { path: /health/live, port: 8080 }
periodSeconds: 20
failureThreshold: 3
startupProbe:
httpGet: { path: /health/ready, port: 8080 }
periodSeconds: 5
failureThreshold: 30 # 150s startup budget
Resources, QoS & limits
QoS class Condition Eviction order
Guaranteed requests == limits for all containers (cpu+memory) Last evicted
Burstable requests set, limits differ or partial Middle
BestEffort No requests or limits First evicted
Field Used for Production guidance
requests.cpuScheduling + HPA % Set based on p95 usage
limits.cpuCPU throttling cap Omit or set high for latency-sensitive apps
requests.memoryScheduling Set == limits for Guaranteed QoS on stateful
limits.memoryHard cap → OOMKill Always set; prevents node-wide OOM
resources:
requests: { cpu: 250m, memory: 512Mi }
limits: { cpu: "1", memory: 512Mi } # Guaranteed memory QoS
kubectl top pod web-abc -n prod
kubectl describe pod web-abc | grep -A5 "Limits\|Requests\|QoS"
Logs, exec, debug & copy
Command Purpose
kubectl logs -f deploy/web -c app --tail=200Stream logs from deployment container
kubectl logs pod/web-abc --previousLogs from crashed container instance
kubectl logs -l app=web --all-containers --prefixAll pods matching label
kubectl logs pod/web-abc --since=10m --timestampsTime-bounded logs
kubectl exec -it pod/web-abc -c app -- shInteractive shell
kubectl exec deploy/web -- curl -s localhost:8080/healthOne-off command
kubectl debug pod/web-abc -it --image=busybox --target=appEphemeral debug container (K8s 1.23+)
kubectl debug node/node-1 -it --image=ubuntuNode shell via privileged debug pod
kubectl cp pod/web-abc:/var/log/app.log ./app.logCopy files from pod
kubectl attach pod/web-abc -c app -i -tAttach to running process stdin
kubectl get events -n dev --sort-by=.lastTimestampRecent cluster events
# Debug toolbox pod (curl, dig, tcpdump, iperf)
kubectl run netshoot --rm -it --restart=Never \
--image=nicolaka/netshoot -- bash
# Test service DNS from inside cluster
kubectl run tmp --rm -it --restart=Never --image=busybox -- \
wget -qO- http://web.dev.svc.cluster.local/health oc logs -f dc/web
oc rsh pod/web-abc
oc exec pod/web-abc -- curl -s localhost:8080
oc debug pod/web-abc -it --image=registry.redhat.io/ubi9/ubi-minimal
Port-forward, proxy & local access
Command Maps Use when
kubectl port-forward svc/web 8080:80Service → local Test app without Ingress
kubectl port-forward pod/web-abc 8080:8080Pod → local Debug specific pod instance
kubectl port-forward deploy/web 8080:8080Deployment (any pod) Load-balanced pick
kubectl port-forward -n kube-system svc/prometheus 9090:9090Cross-namespace Local Grafana/Prometheus access
kubectl proxy --port=8001API server proxy localhost:8001/api/v1/namespaces
DNS & service discovery
Pattern Resolves to
<service>Same namespace Service ClusterIP
<service>.<namespace>Cross-namespace Service
<service>.<namespace>.svc.cluster.localFQDN (always works)
<pod-ip-dashed>.<namespace>.pod.cluster.localPod IP directly (headless)
<statefulset-0>.<headless-svc>.<ns>.svc.cluster.localStatefulSet pod DNS
# Inside pod — check DNS
cat /etc/resolv.conf # nameserver = CoreDNS (10.96.0.10 typical)
nslookup web
dig +short web.dev.svc.cluster.local
Deployment + Service YAML (production baseline)
apiVersion: apps/v1
kind: Deployment
metadata:
name: web
labels: { app: web, app.kubernetes.io/name: web }
spec:
replicas: 3
revisionHistoryLimit: 5
strategy:
type: RollingUpdate
rollingUpdate: { maxUnavailable: 0, maxSurge: 1 }
selector:
matchLabels: { app: web }
template:
metadata:
labels: { app: web }
spec:
serviceAccountName: web
securityContext:
runAsNonRoot: true
seccompProfile: { type: RuntimeDefault }
containers:
- name: app
image: nginx:1.25-alpine@sha256:…
ports: [{ name: http, containerPort: 8080 }]
resources:
requests: { cpu: 100m, memory: 128Mi }
limits: { cpu: 500m, memory: 256Mi }
readinessProbe:
httpGet: { path: /health/ready, port: http }
periodSeconds: 10
livenessProbe:
httpGet: { path: /health/live, port: http }
periodSeconds: 20
securityContext:
allowPrivilegeEscalation: false
capabilities: { drop: [ALL] }
---
apiVersion: v1
kind: Service
metadata:
name: web
spec:
type: ClusterIP
selector: { app: web }
ports:
- name: http
port: 80
targetPort: http
ConfigMap, Secret & Ingress YAML
# ConfigMap — file + key
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
data:
app.properties: |
server.port=8080
feature.flags=beta
LOG_LEVEL: info
---
# Secret — Opaque (base64 values; enable encryption at rest in prod)
apiVersion: v1
kind: Secret
metadata:
name: app-secrets
type: Opaque
stringData: # prefer stringData over data in git templates
DB_PASSWORD: changeme
---
# Mount in pod
envFrom:
- configMapRef: { name: app-config }
env:
- name: DB_PASSWORD
valueFrom:
secretKeyRef: { name: app-secrets, key: DB_PASSWORD }
volumeMounts:
- name: config-vol
mountPath: /etc/app
readOnly: true
volumes:
- name: config-vol
configMap:
name: app-config
---
# Ingress (requires ingress controller)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: web
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
ingressClassName: nginx
tls: [{ hosts: [app.example.com], secretName: app-tls }]
rules:
- host: app.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service: { name: web, port: { number: 80 } }
Job, CronJob & HPA snippets
# Job — run to completion
apiVersion: batch/v1
kind: Job
metadata:
name: migrate-db
spec:
backoffLimit: 3
activeDeadlineSeconds: 600
template:
spec:
restartPolicy: Never
containers:
- name: migrate
image: myapp:migrate-v3
command: ["./migrate.sh"]
---
# CronJob
apiVersion: batch/v1
kind: CronJob
metadata:
name: nightly-report
spec:
schedule: "0 2 * * *"
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 3
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: report
image: myapp:report
---
# HPA — requires metrics-server + CPU requests set
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target: { type: Utilization, averageUtilization: 70 }
Rollout commands (developer view)
kubectl rollout status deployment/web
kubectl rollout history deployment/web
kubectl rollout undo deployment/web
kubectl rollout undo deployment/web --to-revision=3
kubectl rollout pause deployment/web # batch manifest changes
kubectl rollout resume deployment/web
kubectl set image deployment/web app=myreg/web:v2.4.1
kubectl set env deployment/web FEATURE_X=true oc rollout status dc/web
oc rollout history dc/web
oc set env dc/web FEATURE_X=true
oc tag myapp:v2.4.1 web:latest # ImageStream trigger
💡 Pro Tip
kubectl explain + --dry-run=client -o yaml covers most CKAD exam YAML generation—memorize probe fields and resource units (100m, 128Mi).
DevOps / Platform Cheat Sheet
RBAC, Helm, Kustomize, GitOps, node ops, networking, observability. Chapters: RBAC , Helm , GitOps , Production Ops .
devops
Copy sheet
RBAC — audit & grant commands
Command Purpose
kubectl auth can-i create podsCheck your permissions
kubectl auth can-i create pods --as=system:serviceaccount:ns:saCheck SA permissions
kubectl auth can-i --list --as=alice@corp.comList all allowed verbs
kubectl auth can-i get secrets -n prod --as=group:dev-teamCheck group access
kubectl create rolebinding dev-edit --clusterrole=edit --user=alice -n devGrant namespace edit
kubectl create clusterrolebinding alice-admin --clusterrole=cluster-admin --user=aliceCluster admin (avoid)
kubectl get role,rolebinding,clusterrole,clusterrolebinding -AAudit bindings
# Minimal Role + RoleBinding
kubectl create role pod-reader --verb=get,list,watch --resource=pods -n dev
kubectl create rolebinding read-pods --role=pod-reader \
--serviceaccount=dev:ci-runner -n dev oc adm policy add-role-to-user edit alice -n dev
oc adm policy add-role-to-group cluster-reader ops-team
oc adm policy who-can get secrets -n prod
oc adm policy add-scc-to-user anyuid -z legacy-sa -n legacy
oc adm policy add-scc-to-group privileged -z operator-sa -n operators
Pod Security Admission (PSA) namespace labels
# Enforce restricted profile on namespace
apiVersion: v1
kind: Namespace
metadata:
name: prod
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/enforce-version: latest
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
kubectl label namespace prod \
pod-security.kubernetes.io/enforce=restricted --overwrite
NetworkPolicy patterns
# Default deny all ingress in namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
spec:
podSelector: {}
policyTypes: [Ingress]
---
# Allow ingress from same namespace + from ingress-nginx
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: web-allow
spec:
podSelector:
matchLabels: { app: web }
policyTypes: [Ingress, Egress]
ingress:
- from:
- podSelector: {}
- namespaceSelector:
matchLabels: { kubernetes.io/metadata.name: ingress-nginx }
ports: [{ protocol: TCP, port: 8080 }]
egress:
- to:
- podSelector:
matchLabels: { app: postgres }
ports: [{ protocol: TCP, port: 5432 }]
- to:
- namespaceSelector: {}
podSelector:
matchLabels: { k8s-app: kube-dns }
ports: [{ protocol: UDP, port: 53 }]
Helm — install, upgrade, lifecycle
Command Purpose
helm repo add bitnami https://charts.bitnami.com/bitnamiAdd chart repository
helm repo updateRefresh chart index
helm search repo redisFind charts
helm show values bitnami/redisDefault values
helm install redis bitnami/redis -f values.yaml -n cache --create-namespaceInstall release
helm upgrade redis bitnami/redis -f values.yaml --atomic --timeout 10mUpgrade with rollback on fail
helm rollback redis 2Rollback to revision
helm history redisRevision list
helm get values redis -aDeployed values
helm template myapp ./chart -f prod.yamlRender without install
helm uninstall redis -n cacheRemove release
helm lint ./chartValidate chart
helm plugin install https://github.com/databus23/helm-diffDiff before upgrade
# Production upgrade pattern
helm diff upgrade redis bitnami/redis -f values-prod.yaml -n cache
helm upgrade redis bitnami/redis -f values-prod.yaml -n cache \
--atomic --cleanup-on-fail --timeout 15m --wait
# OCI registry charts
helm install myapp oci://registry.example.com/charts/myapp --version 1.2.3
Kustomize — base, overlays, patches
# Directory layout
# base/kustomization.yaml deployment.yaml service.yaml
# overlays/dev/kustomization.yaml
# overlays/prod/kustomization.yaml
# base/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources: [deployment.yaml, service.yaml]
commonLabels: { app.kubernetes.io/managed-by: kustomize }
# overlays/prod/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources: [../../base]
namespace: prod
namePrefix: prod-
replicas: [{ name: web, count: 5 }]
images:
- name: myapp
newName: registry.example.com/myapp
newTag: v2.4.1
patches:
- path: resources-patch.yaml
- target: { kind: Deployment, name: web }
patch: |
- op: replace
path: /spec/template/spec/containers/0/resources/limits/memory
value: 1Gi
configMapGenerator:
- name: app-config
literals: [ENV=prod]
secretGenerator:
- name: app-secrets
envs: [.env.prod]
kubectl apply -k overlays/prod
kustomize build overlays/prod | kubectl diff -f -
ArgoCD & Flux GitOps
Tool Command Purpose
ArgoCD argocd app listList applications
ArgoCD argocd app sync myapp --pruneSync + delete orphans
ArgoCD argocd app diff myappLive vs git diff
ArgoCD argocd app rollback myappRollback deployment history
ArgoCD argocd app set myapp --sync-policy automated --self-healEnable auto-sync
Flux flux get kustomizations -AReconciliation status
Flux flux reconcile source git my-repoForce git pull
Flux flux reconcile kustomization apps --with-sourceForce full sync
Flux flux suspend kustomization appsPause reconciliation
# ArgoCD Application CR (minimal)
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: web
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/org/gitops.git
path: apps/web/overlays/prod
targetRevision: main
destination:
server: https://kubernetes.default.svc
namespace: prod
syncPolicy:
automated: { prune: true, selfHeal: true }
syncOptions: [CreateNamespace=true]
Node management — cordon, drain, uncordon
# Safe node maintenance sequence
kubectl cordon node-1 # no new pods
kubectl drain node-1 \
--ignore-daemonsets \
--delete-emptydir-data \
--grace-period=60 \
--timeout=15m
kubectl uncordon node-1
# Check node conditions
kubectl describe node node-1 | grep -A5 Conditions
kubectl get nodes -o custom-columns=NAME:.metadata.name,STATUS:.status.conditions[-1].type,TAINTS:.spec.taints
# Taint node for dedicated workload
kubectl taint nodes gpu-1 nvidia.com/gpu=true:NoSchedule
kubectl taint nodes gpu-1 nvidia.com/gpu-:NoSchedule # remove oc adm cordon node-1
oc adm drain node-1 --ignore-daemonsets --delete-emptydir-data --force
oc adm uncordon node-1
oc get machineconfigpool # OCP node update pools
Rollouts, PDB & HPA (platform)
# PodDisruptionBudget — required before drain
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-pdb
spec:
minAvailable: 2
selector:
matchLabels: { app: web }
kubectl get pdb -A
kubectl rollout status deployment/web -w --timeout=5m
# Force restart all pods (new spec, same image)
kubectl rollout restart deployment/web
Storage — PVC, StorageClass, snapshots
kubectl get sc
kubectl get pvc,pv -A
kubectl describe pvc data-web-0 -n prod
# Expand PVC (StorageClass must allowVolumeExpansion: true)
kubectl patch pvc data-web-0 -n prod \
-p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'
# VolumeSnapshot
kubectl get volumesnapshot -A
kubectl create -f snapshot.yaml oc get storageclass
oc describe pvc data-web-0
Event debugging & cluster triage
Command Finds
kubectl get events -A --field-selector type=Warning --sort-by=.lastTimestampRecent warnings cluster-wide
kubectl get pods -A --field-selector status.phase=PendingStuck scheduling
kubectl get pods -A | grep -E 'Error|CrashLoop|Evicted|OOM'Unhealthy pods
kubectl top nodesNode CPU/memory (metrics-server)
kubectl top pods -A --sort-by=memory | head -20Memory hogs
kubectl get endpoints,endpointslice -n prodEmpty endpoints = selector mismatch
kubectl describe quota,limitrange -n prodNamespace resource caps
kubectl get --raw /readyz?verboseAPI server health
kubectl get --raw /metrics | headRaw API metrics (if enabled)
Tekton & OpenShift builds
# Tekton (generic K8s)
kubectl get task,pipeline,pipelinerun -n ci
tkn pipeline start build-deploy -p git-url=https://github.com/org/app -w name=shared-workspace,claimName=build-ws
tkn pipelinerun logs -f -n ci oc start-build web --from-dir=. --follow
oc logs -f bc/web
oc get builds
oc new-build nodejs~https://github.com/org/app.git
oc import-image nodejs:18 --confirm
cert-manager & External Secrets
# cert-manager Certificate
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: app-tls
spec:
secretName: app-tls
issuerRef: { name: letsencrypt-prod, kind: ClusterIssuer }
dnsNames: [app.example.com]
kubectl get certificate,certificaterequest,clusterissuer -A
kubectl describe certificate app-tls -n prod
# External Secrets Operator
kubectl get externalsecret,secretstore -A
⚠️ Pitfall
Never run kubectl apply from CI to production—push to GitOps repo instead. Drift + no audit trail + race conditions with concurrent pipelines.
etcd — health, backup & restore
Operation Command Notes
Cluster health etcdctl endpoint health --clusterAll members must be healthy
Leader check etcdctl endpoint status -w tableWatch RAFT INDEX lag
Member list etcdctl member list -w tableOdd count: 3 or 5
Snapshot etcdctl snapshot save /backup/etcd.dbStop writes not required (consistent)
Verify snapshot etcdctl snapshot status /backup/etcd.db -w tableCheck hash + revision
Defrag etcdctl defrag --clusterReclaim space; brief latency spike
Alarm list etcdctl alarm listNOSPACE = quota exceeded
# Snapshot (on control plane node)
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%F-%H%M).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Restore procedure (disaster — simplified)
# 1. Stop kube-apiserver on all control plane nodes
# 2. ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd.db \
# --data-dir=/var/lib/etcd-restored --name=etcd-0 \
# --initial-cluster=etcd-0=https://10.0.0.1:2380 \
# --initial-advertise-peer-urls=https://10.0.0.1:2380
# 3. Update etcd manifest data-dir; restart etcd → apiserver
# OpenShift
oc adm etcd-backup --backup-dir=/backup
oc adm etcd-snapshot-backup --name=pre-upgrade-$(date +%F)
Scheduling decision matrix
Requirement Primitive Effect Avoid
Dedicated GPU/spot nodes Taint + Toleration Repel/allow specific nodes Hardcoded nodeName
Run on SSD nodes only nodeSelector or nodeAffinity Hard/soft label match Taints for simple labels
Co-locate app + sidecar cache podAffinity Same node/zone Shared emptyDir across nodes
One replica per node (HA) podAntiAffinity or topologySpread Spread across nodes replicas > node count
Even spread across AZs topologySpreadConstraints maxSkew across zones Manual per-zone deploys
Priority during eviction priorityClassName High priority survives pressure BestEffort for critical apps
Preemption PriorityClass value Higher evicts lower Same priority for all
# Taint GPU node
kubectl taint nodes gpu-1 nvidia.com/gpu=present:NoSchedule
# Toleration in pod spec
tolerations:
- key: nvidia.com/gpu
operator: Equal
value: present
effect: NoSchedule
# Topology spread across zones
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels: { app: web }
Taints, effects & use cases
Effect New pods Existing pods Use case
NoScheduleBlocked unless tolerated Untouched GPU, dedicated pools
PreferNoScheduleSoft avoid Untouched Prefer on-demand over spot
NoExecuteBlocked Evicted unless tolerated Node maintenance, NotReady
Built-in taints: node.kubernetes.io/not-ready:NoExecute (5min), node.kubernetes.io/unreachable:NoExecute, node.kubernetes.io/disk-pressure:NoSchedule.
Storage — access modes & class comparison
Access mode Abbrev Meaning Typical backend
ReadWriteOnce RWO One node read-write EBS, Azure Disk, Ceph RBD
ReadOnlyMany ROX Many nodes read-only NFS export, Config volumes
ReadWriteMany RWX Many nodes read-write EFS, CephFS, Azure Files, NFS
ReadWriteOncePod RWOP Single pod exclusive (1.22+) Block volumes needing true exclusive
Backend Provisioner Access ReclaimPolicy (DB) Binding mode
AWS EBS gp3 ebs.csi.aws.com RWO Retain WaitForFirstConsumer
GCE PD pd.csi.storage.gke.io RWO Retain WaitForFirstConsumer
Azure Disk disk.csi.azure.com RWO Retain WaitForFirstConsumer
AWS EFS efs.csi.aws.com RWX Retain Immediate
Rook-Ceph RBD rook-ceph.rbd.csi.ceph.com RWO Retain WaitForFirstConsumer
OCP ODF RBD ocs-storagecluster-ceph-rbd RWO Retain WaitForFirstConsumer
OCP ODF CephFS ocs-storagecluster-cephfs RWX Retain Immediate
Longhorn driver.longhorn.io RWO/RWX Retain Immediate
OpenShift SCC reference
SCC runAsUser Capabilities When to grant
restricted-v2Random non-root Dropped Default — no action needed
nonroot-v2Specific non-root UID Dropped App requires fixed non-root UID
anyuidAny UID including root Dropped Legacy containers as root
privilegedAny All Operators/system only — never apps
hostnetwork-v2Non-root Host network DaemonSets needing hostNetwork
hostmount-anyuidAny Host path mounts Legacy hostPath requirements
# Diagnose SCC failure
oc describe pod failing-pod | tail -20
oc get scc
oc adm policy who-can use scc privileged
# Grant SCC to ServiceAccount (minimal privilege)
oc adm policy add-scc-to-user nonroot-v2 -z myapp-sa -n prod
oc adm policy add-scc-to-user anyuid -z legacy-sa -n legacy
Cluster upgrade checklist
# Step Command / action
1 Check deprecated APIs pluto detect-files -d manifests/
2 Verify PDBs exist kubectl get pdb -A
3 Confirm node capacity Enough spare nodes for rolling drain
4 Backup etcd etcdctl snapshot save or oc adm etcd-snapshot-backup
5 Velero backup velero backup create pre-upgrade-$(date +%F)
6 Upgrade control plane Kubeadm / managed service / oc adm upgrade
7 Upgrade workers One node at a time; drain → upgrade → uncordon
8 Verify operators oc get clusteroperators / CRD health
9 Smoke test kubectl get --raw /readyz?verbose
10 Validate workloads Critical app health checks + synthetic probes
# OpenShift upgrade
oc adm upgrade
oc adm upgrade --to=4.16.8
oc adm upgrade --to-latest=true
oc get clusterversion
oc get mcp -w # MachineConfigPool progress
oc get co -w # ClusterOperators
# K8s version policy: stay within N-2 of latest minor
Velero — install, backup, restore, schedule
# Install (AWS example)
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.9.0 \
--bucket k8s-backups-prod \
--secret-file ./credentials-velero \
--backup-location-config region=us-east-1 \
--snapshot-location-config region=us-east-1 \
--use-volume-snapshots=true \
--use-node-agent=false
# Backup
velero backup create daily-$(date +%F) \
--include-namespaces prod,staging \
--exclude-resources events,events.events.k8s.io \
--ttl 720h0m0s
velero backup describe daily-2026-06-05 --details
velero backup logs daily-2026-06-05
# Schedule
velero schedule create nightly \
--schedule="0 2 * * *" \
--include-namespaces prod \
--ttl 168h
# Restore
velero restore create restore-$(date +%F) \
--from-backup daily-2026-06-05 \
--include-namespaces prod
velero restore describe restore-2026-06-05
kubectl get restores.velero.io -n velero
Capacity planning & control plane sizing
Cluster scale Control plane (each) etcd Notes
< 100 nodes 2 CPU / 8 GB 2 CPU / 8 GB SSD 3 CP nodes for HA
100–500 nodes 4 CPU / 16 GB 4 CPU / 8 GB NVMe Dedicated etcd disk <10ms latency
500–3000 nodes 8 CPU / 32 GB 8 CPU / 16 GB NVMe Watch API 429 rate limits
3000+ nodes 16+ CPU / 64 GB 8 CPU / 32 GB NVMe Consider etcd defrag schedule
Rule Guidance
Node memory headroom Sum of pod memory requests + 1–2 GB OS overhead per node
Cluster autoscaler buffer 10–20% spare capacity for burst scheduling
etcd quota Default 2 GB; monitor etcd_mvcc_db_total_size_in_bytes
Max pods per node Default 110; reduce if dense memory workloads
API server throttling 429 responses → increase priority/fairness or scale CP
Production failure matrix
Symptom Root causes Fix commands
ImagePullBackOff Wrong tag, private registry, missing pull secret describe pod; create docker-registry secret
CrashLoopBackOff App error, bad CMD, missing config, probe too aggressive logs --previous; relax startupProbe
Pending Insufficient CPU/mem, taints, affinity, unbound PVC describe pod; get nodes; get pvc
OOMKilled Memory limit too low Raise limit; profile heap; VPA recommendation
Evicted Node disk/memory pressure Clean images; add nodes; fix disk usage
Service unreachable Selector mismatch, empty endpoints, NetworkPolicy get endpoints; test from netshoot pod
SCC violation (OCP) Root user, missing capability, hostPath oc describe pod; grant correct SCC to SA
etcd NOSPACE Quota exceeded Defrag; compact; raise quota; restore from snapshot
Multi-cluster & service mesh decisions
Pattern Tool When
Hub-spoke fleet management Red Hat ACM + Hive Policy, apps, upgrades across 10+ clusters
Hosted control planes HyperShift / ROSA HCP Fast cheap cluster provisioning
Declarative cluster lifecycle Cluster API (CAPI) GitOps-native cluster create/destroy
App delivery multi-cluster ArgoCD ApplicationSet Same app to clusters by label
mTLS + traffic management Istio / OpenShift Service Mesh Microservices observability + zero-trust
Lightweight mTLS Linkerd Simpler mesh, lower overhead
Disaster recovery — RTO/RPO targets
Strategy RPO RTO Complexity
etcd snapshot only Hours (backup interval) Hours Low
Velero + volume snapshots Minutes–hours Hours Medium
Active-passive cluster Near-zero (async replication) Minutes High
Active-active multi-region Near-zero Minutes Very high
oc adm — platform administration
oc adm top nodes
oc adm top pods -A
oc adm groups sync --sync-config=ldap-sync.yaml --confirm
oc adm node-logs --role=master --path=openshift-apiserver/
oc adm must-gather # support bundle
oc adm release info 4.16.8 # release metadata
oc adm upgrade --allow-explicit-upgrade --to=4.16.8
oc adm pod-network join-projects --selector env=prod # join network
oc adm policy scc-subject review -z myapp-sa -n prod # which SCC applies
oc get clusteroperators # platform health
oc get co -o custom-columns=NAME:.metadata.name,AVAILABLE:.status.conditions[?(@.type=="Available")].status
oc get nodes -o wide
oc debug node/worker-1 -- chroot /host bash # node shell (OCP 4)
GitOps & platform architecture checklist
# Decision Recommendation
1 Package manager Helm for third-party; Kustomize for in-house apps
2 GitOps engine ArgoCD (UI + ApplicationSet) or Flux (modular)
3 Secrets in git Sealed Secrets, External Secrets, or SOPS — never plain
4 Ingress Gateway API (future); Ingress + cert-manager today
5 Network policy Default-deny + explicit allow per app
6 Pod security PSA restricted (K8s) + SCC restricted-v2 (OCP)
7 Observability Prometheus + Grafana + Loki + OTel (or OCP built-in monitoring)
8 Backup etcd snapshot + Velero for namespace/PV DR
9 Multi-tenancy Namespace isolation + RBAC + NetworkPolicy + quotas
10 Cluster upgrades N-2 policy; PDB; blue-green cluster for zero-downtime migrations
🎯 Interview Tip
System design whiteboard: draw control plane (API → etcd), explain scheduler filter/score/bind, mention PDB + topology spread for HA, etcd backup for RPO, and GitOps for drift control. For OCP, add SCC + Routes + Operators.
⚖️ Trade-off
Retain reclaim policy protects database PVs but requires manual cleanup — automate with Velero TTL policies or a PV janitor CronJob.