Cheat Sheets

Three exhaustive quick references by persona—daily kubectl/oc workflows, platform operations, and cluster architecture decisions. Use Copy sheet for plain text, the CLI toggle for OpenShift variants, or print (Cmd/Ctrl+P).

developer devops architect K8s 1.28+ OCP 4.16+ CKA/CKAD/CKS

Developer Cheat Sheet

Daily commands, pod lifecycle, YAML templates, debugging, networking. Chapters: Workloads, Configuration, Networking.

developer

Context, namespace & discovery

TaskCommandNotes
List contextskubectl config get-contexts* = current
Switch contextkubectl config use-context prod-eastCluster + user + namespace
Current contextkubectl config current-context
Set default namespacekubectl config set-context --current --namespace=devAvoid -n on every command
List all namespaceskubectl get nsoc get projects on OCP
Explain API fieldkubectl explain pod.spec.containers.livenessProbeRecursive: --recursive
API resourceskubectl api-resourcesShort names: po, deploy, svc
Watch resourcekubectl get pods -w--watch-only skips initial list

kubectl get — common flags

FlagPurposeExample
-nNamespacekubectl get pods -n prod
-A / --all-namespacesAll namespaceskubectl get pods -A
-lLabel selector-l app=web,env=prod
-o wideExtra columns (node, IP)kubectl get pods -o wide
-o yaml / -o jsonFull manifestPipe to grep / save template
-o nameResource name onlyFor loops: for p in $(kubectl get pods -o name); do …
--sort-bySort output--sort-by=.metadata.creationTimestamp
--field-selectorFilter by fieldstatus.phase=Pending
--show-labelsPrint labels columnDebug selector mismatches
-wWatch streamLive deploy monitoring

Apply, create & imperative shortcuts

CommandPurposeNotes
kubectl apply -f manifest.yamlDeclarative create/updatePreferred for GitOps
kubectl apply -f dir/Apply all YAML in directoryRecursive with -R
kubectl apply -k overlays/devKustomize overlayBuilt-in Kustomize
kubectl apply --dry-run=client -o yamlClient-side dry runGenerate manifest without server
kubectl apply --server-sideServer-side applyField ownership tracking
kubectl diff -f manifest.yamlPreview changesCI gate before apply
kubectl create deployment web --image=nginx:1.25Imperative deployQuick test; export to YAML after
kubectl expose deployment web --port=80Create ServiceClusterIP by default
kubectl scale deployment web --replicas=5Scale replicasHPA overrides manual scale
kubectl delete -f manifest.yamlDelete by manifest--ignore-not-found in scripts
kubectl replace --force -f pod.yamlDelete + recreateDisruptive; avoid in prod
# Export running deployment to YAML (strip cluster metadata)
kubectl get deploy web -o yaml | kubectl neat > web.yaml

# Generate YAML without creating (exam pattern)
kubectl create deployment web --image=nginx --dry-run=client -o yaml > deploy.yamloc new-app --name=web nginx:1.25
oc expose dc/web
oc get route web -o jsonpath='{.spec.host}'
oc status
oc logs -f dc/web

Pod lifecycle, phases & conditions

Status / PhaseMeaningFirst diagnostic step
PendingNot yet scheduled or image pullingkubectl describe pod → Events; check scheduler/taints/PVC
ContainerCreatingScheduled; pulling image / mounting volumesEvents; describe volume mounts
RunningAt least one container runningCheck readiness if not receiving traffic
SucceededAll containers exited 0 (Job)Expected for batch; check Job status
FailedContainer exited non-zerokubectl logs --previous
CrashLoopBackOffContainer crash → backoff retrylogs --previous; check exit code, probes, config
ImagePullBackOffCannot pull imageVerify tag, registry auth, imagePullSecrets
ErrImagePullImmediate pull failureSame as ImagePullBackOff
OOMKilledMemory limit exceededRaise limits.memory or fix leak
EvictedNode pressure evictionkubectl describe node; disk/memory pressure
TerminatingDeletion in progressStuck? Check finalizers, PDB, grace period
ConditionTrue means
PodScheduledAssigned to a node
InitializedInit containers completed
ContainersReadyAll containers passed readiness
ReadyPod eligible for Service endpoints

Probes — liveness, readiness, startup

ProbePurposeFailure actionTypical settings
readinessProbeReady for traffic?Removed from Service endpointsperiodSeconds: 10, failureThreshold: 3
livenessProbeProcess alive?Container restartLess aggressive than readiness; avoid DB checks
startupProbeSlow-start appsBlocks liveness until successfailureThreshold: 30 for JVM warm-up
readinessProbe:
  httpGet: { path: /health/ready, port: 8080, scheme: HTTP }
  initialDelaySeconds: 5
  periodSeconds: 10
  timeoutSeconds: 2
  failureThreshold: 3
livenessProbe:
  httpGet: { path: /health/live, port: 8080 }
  periodSeconds: 20
  failureThreshold: 3
startupProbe:
  httpGet: { path: /health/ready, port: 8080 }
  periodSeconds: 5
  failureThreshold: 30   # 150s startup budget

Resources, QoS & limits

QoS classConditionEviction order
Guaranteedrequests == limits for all containers (cpu+memory)Last evicted
Burstablerequests set, limits differ or partialMiddle
BestEffortNo requests or limitsFirst evicted
FieldUsed forProduction guidance
requests.cpuScheduling + HPA %Set based on p95 usage
limits.cpuCPU throttling capOmit or set high for latency-sensitive apps
requests.memorySchedulingSet == limits for Guaranteed QoS on stateful
limits.memoryHard cap → OOMKillAlways set; prevents node-wide OOM
resources:
  requests: { cpu: 250m, memory: 512Mi }
  limits:   { cpu: "1", memory: 512Mi }   # Guaranteed memory QoS

kubectl top pod web-abc -n prod
kubectl describe pod web-abc | grep -A5 "Limits\|Requests\|QoS"

Logs, exec, debug & copy

CommandPurpose
kubectl logs -f deploy/web -c app --tail=200Stream logs from deployment container
kubectl logs pod/web-abc --previousLogs from crashed container instance
kubectl logs -l app=web --all-containers --prefixAll pods matching label
kubectl logs pod/web-abc --since=10m --timestampsTime-bounded logs
kubectl exec -it pod/web-abc -c app -- shInteractive shell
kubectl exec deploy/web -- curl -s localhost:8080/healthOne-off command
kubectl debug pod/web-abc -it --image=busybox --target=appEphemeral debug container (K8s 1.23+)
kubectl debug node/node-1 -it --image=ubuntuNode shell via privileged debug pod
kubectl cp pod/web-abc:/var/log/app.log ./app.logCopy files from pod
kubectl attach pod/web-abc -c app -i -tAttach to running process stdin
kubectl get events -n dev --sort-by=.lastTimestampRecent cluster events
# Debug toolbox pod (curl, dig, tcpdump, iperf)
kubectl run netshoot --rm -it --restart=Never \
  --image=nicolaka/netshoot -- bash

# Test service DNS from inside cluster
kubectl run tmp --rm -it --restart=Never --image=busybox -- \
  wget -qO- http://web.dev.svc.cluster.local/healthoc logs -f dc/web
oc rsh pod/web-abc
oc exec pod/web-abc -- curl -s localhost:8080
oc debug pod/web-abc -it --image=registry.redhat.io/ubi9/ubi-minimal

Port-forward, proxy & local access

CommandMapsUse when
kubectl port-forward svc/web 8080:80Service → localTest app without Ingress
kubectl port-forward pod/web-abc 8080:8080Pod → localDebug specific pod instance
kubectl port-forward deploy/web 8080:8080Deployment (any pod)Load-balanced pick
kubectl port-forward -n kube-system svc/prometheus 9090:9090Cross-namespaceLocal Grafana/Prometheus access
kubectl proxy --port=8001API server proxylocalhost:8001/api/v1/namespaces

DNS & service discovery

PatternResolves to
<service>Same namespace Service ClusterIP
<service>.<namespace>Cross-namespace Service
<service>.<namespace>.svc.cluster.localFQDN (always works)
<pod-ip-dashed>.<namespace>.pod.cluster.localPod IP directly (headless)
<statefulset-0>.<headless-svc>.<ns>.svc.cluster.localStatefulSet pod DNS
# Inside pod — check DNS
cat /etc/resolv.conf          # nameserver = CoreDNS (10.96.0.10 typical)
nslookup web
dig +short web.dev.svc.cluster.local

Deployment + Service YAML (production baseline)

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
  labels: { app: web, app.kubernetes.io/name: web }
spec:
  replicas: 3
  revisionHistoryLimit: 5
  strategy:
    type: RollingUpdate
    rollingUpdate: { maxUnavailable: 0, maxSurge: 1 }
  selector:
    matchLabels: { app: web }
  template:
    metadata:
      labels: { app: web }
    spec:
      serviceAccountName: web
      securityContext:
        runAsNonRoot: true
        seccompProfile: { type: RuntimeDefault }
      containers:
        - name: app
          image: nginx:1.25-alpine@sha256:…
          ports: [{ name: http, containerPort: 8080 }]
          resources:
            requests: { cpu: 100m, memory: 128Mi }
            limits:   { cpu: 500m, memory: 256Mi }
          readinessProbe:
            httpGet: { path: /health/ready, port: http }
            periodSeconds: 10
          livenessProbe:
            httpGet: { path: /health/live, port: http }
            periodSeconds: 20
          securityContext:
            allowPrivilegeEscalation: false
            capabilities: { drop: [ALL] }
---
apiVersion: v1
kind: Service
metadata:
  name: web
spec:
  type: ClusterIP
  selector: { app: web }
  ports:
    - name: http
      port: 80
      targetPort: http

ConfigMap, Secret & Ingress YAML

# ConfigMap — file + key
apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
data:
  app.properties: |
    server.port=8080
    feature.flags=beta
  LOG_LEVEL: info
---
# Secret — Opaque (base64 values; enable encryption at rest in prod)
apiVersion: v1
kind: Secret
metadata:
  name: app-secrets
type: Opaque
stringData:                    # prefer stringData over data in git templates
  DB_PASSWORD: changeme
---
# Mount in pod
envFrom:
  - configMapRef: { name: app-config }
env:
  - name: DB_PASSWORD
    valueFrom:
      secretKeyRef: { name: app-secrets, key: DB_PASSWORD }
volumeMounts:
  - name: config-vol
    mountPath: /etc/app
    readOnly: true
volumes:
  - name: config-vol
    configMap:
      name: app-config
---
# Ingress (requires ingress controller)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: web
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  ingressClassName: nginx
  tls: [{ hosts: [app.example.com], secretName: app-tls }]
  rules:
    - host: app.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service: { name: web, port: { number: 80 } }

Job, CronJob & HPA snippets

# Job — run to completion
apiVersion: batch/v1
kind: Job
metadata:
  name: migrate-db
spec:
  backoffLimit: 3
  activeDeadlineSeconds: 600
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: migrate
          image: myapp:migrate-v3
          command: ["./migrate.sh"]
---
# CronJob
apiVersion: batch/v1
kind: CronJob
metadata:
  name: nightly-report
spec:
  schedule: "0 2 * * *"
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: report
              image: myapp:report
---
# HPA — requires metrics-server + CPU requests set
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target: { type: Utilization, averageUtilization: 70 }

Rollout commands (developer view)

kubectl rollout status deployment/web
kubectl rollout history deployment/web
kubectl rollout undo deployment/web
kubectl rollout undo deployment/web --to-revision=3
kubectl rollout pause deployment/web    # batch manifest changes
kubectl rollout resume deployment/web
kubectl set image deployment/web app=myreg/web:v2.4.1
kubectl set env deployment/web FEATURE_X=trueoc rollout status dc/web
oc rollout history dc/web
oc set env dc/web FEATURE_X=true
oc tag myapp:v2.4.1 web:latest          # ImageStream trigger
💡 Pro Tip

kubectl explain + --dry-run=client -o yaml covers most CKAD exam YAML generation—memorize probe fields and resource units (100m, 128Mi).

DevOps / Platform Cheat Sheet

RBAC, Helm, Kustomize, GitOps, node ops, networking, observability. Chapters: RBAC, Helm, GitOps, Production Ops.

devops

RBAC — audit & grant commands

CommandPurpose
kubectl auth can-i create podsCheck your permissions
kubectl auth can-i create pods --as=system:serviceaccount:ns:saCheck SA permissions
kubectl auth can-i --list --as=alice@corp.comList all allowed verbs
kubectl auth can-i get secrets -n prod --as=group:dev-teamCheck group access
kubectl create rolebinding dev-edit --clusterrole=edit --user=alice -n devGrant namespace edit
kubectl create clusterrolebinding alice-admin --clusterrole=cluster-admin --user=aliceCluster admin (avoid)
kubectl get role,rolebinding,clusterrole,clusterrolebinding -AAudit bindings
# Minimal Role + RoleBinding
kubectl create role pod-reader --verb=get,list,watch --resource=pods -n dev
kubectl create rolebinding read-pods --role=pod-reader \
  --serviceaccount=dev:ci-runner -n devoc adm policy add-role-to-user edit alice -n dev
oc adm policy add-role-to-group cluster-reader ops-team
oc adm policy who-can get secrets -n prod
oc adm policy add-scc-to-user anyuid -z legacy-sa -n legacy
oc adm policy add-scc-to-group privileged -z operator-sa -n operators

Pod Security Admission (PSA) namespace labels

# Enforce restricted profile on namespace
apiVersion: v1
kind: Namespace
metadata:
  name: prod
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/enforce-version: latest
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

kubectl label namespace prod \
  pod-security.kubernetes.io/enforce=restricted --overwrite

NetworkPolicy patterns

# Default deny all ingress in namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
spec:
  podSelector: {}
  policyTypes: [Ingress]
---
# Allow ingress from same namespace + from ingress-nginx
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: web-allow
spec:
  podSelector:
    matchLabels: { app: web }
  policyTypes: [Ingress, Egress]
  ingress:
    - from:
        - podSelector: {}
        - namespaceSelector:
            matchLabels: { kubernetes.io/metadata.name: ingress-nginx }
      ports: [{ protocol: TCP, port: 8080 }]
  egress:
    - to:
        - podSelector:
            matchLabels: { app: postgres }
      ports: [{ protocol: TCP, port: 5432 }]
    - to:
        - namespaceSelector: {}
          podSelector:
            matchLabels: { k8s-app: kube-dns }
      ports: [{ protocol: UDP, port: 53 }]

Helm — install, upgrade, lifecycle

CommandPurpose
helm repo add bitnami https://charts.bitnami.com/bitnamiAdd chart repository
helm repo updateRefresh chart index
helm search repo redisFind charts
helm show values bitnami/redisDefault values
helm install redis bitnami/redis -f values.yaml -n cache --create-namespaceInstall release
helm upgrade redis bitnami/redis -f values.yaml --atomic --timeout 10mUpgrade with rollback on fail
helm rollback redis 2Rollback to revision
helm history redisRevision list
helm get values redis -aDeployed values
helm template myapp ./chart -f prod.yamlRender without install
helm uninstall redis -n cacheRemove release
helm lint ./chartValidate chart
helm plugin install https://github.com/databus23/helm-diffDiff before upgrade
# Production upgrade pattern
helm diff upgrade redis bitnami/redis -f values-prod.yaml -n cache
helm upgrade redis bitnami/redis -f values-prod.yaml -n cache \
  --atomic --cleanup-on-fail --timeout 15m --wait

# OCI registry charts
helm install myapp oci://registry.example.com/charts/myapp --version 1.2.3

Kustomize — base, overlays, patches

# Directory layout
# base/kustomization.yaml  deployment.yaml  service.yaml
# overlays/dev/kustomization.yaml
# overlays/prod/kustomization.yaml

# base/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources: [deployment.yaml, service.yaml]
commonLabels: { app.kubernetes.io/managed-by: kustomize }

# overlays/prod/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources: [../../base]
namespace: prod
namePrefix: prod-
replicas: [{ name: web, count: 5 }]
images:
  - name: myapp
    newName: registry.example.com/myapp
    newTag: v2.4.1
patches:
  - path: resources-patch.yaml
  - target: { kind: Deployment, name: web }
    patch: |
      - op: replace
        path: /spec/template/spec/containers/0/resources/limits/memory
        value: 1Gi
configMapGenerator:
  - name: app-config
    literals: [ENV=prod]
secretGenerator:
  - name: app-secrets
    envs: [.env.prod]

kubectl apply -k overlays/prod
kustomize build overlays/prod | kubectl diff -f -

ArgoCD & Flux GitOps

ToolCommandPurpose
ArgoCDargocd app listList applications
ArgoCDargocd app sync myapp --pruneSync + delete orphans
ArgoCDargocd app diff myappLive vs git diff
ArgoCDargocd app rollback myappRollback deployment history
ArgoCDargocd app set myapp --sync-policy automated --self-healEnable auto-sync
Fluxflux get kustomizations -AReconciliation status
Fluxflux reconcile source git my-repoForce git pull
Fluxflux reconcile kustomization apps --with-sourceForce full sync
Fluxflux suspend kustomization appsPause reconciliation
# ArgoCD Application CR (minimal)
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: web
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/org/gitops.git
    path: apps/web/overlays/prod
    targetRevision: main
  destination:
    server: https://kubernetes.default.svc
    namespace: prod
  syncPolicy:
    automated: { prune: true, selfHeal: true }
    syncOptions: [CreateNamespace=true]

Node management — cordon, drain, uncordon

# Safe node maintenance sequence
kubectl cordon node-1                          # no new pods
kubectl drain node-1 \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --grace-period=60 \
  --timeout=15m
kubectl uncordon node-1

# Check node conditions
kubectl describe node node-1 | grep -A5 Conditions
kubectl get nodes -o custom-columns=NAME:.metadata.name,STATUS:.status.conditions[-1].type,TAINTS:.spec.taints

# Taint node for dedicated workload
kubectl taint nodes gpu-1 nvidia.com/gpu=true:NoSchedule
kubectl taint nodes gpu-1 nvidia.com/gpu-:NoSchedule   # removeoc adm cordon node-1
oc adm drain node-1 --ignore-daemonsets --delete-emptydir-data --force
oc adm uncordon node-1
oc get machineconfigpool                          # OCP node update pools

Rollouts, PDB & HPA (platform)

# PodDisruptionBudget — required before drain
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels: { app: web }

kubectl get pdb -A
kubectl rollout status deployment/web -w --timeout=5m

# Force restart all pods (new spec, same image)
kubectl rollout restart deployment/web

Storage — PVC, StorageClass, snapshots

kubectl get sc
kubectl get pvc,pv -A
kubectl describe pvc data-web-0 -n prod

# Expand PVC (StorageClass must allowVolumeExpansion: true)
kubectl patch pvc data-web-0 -n prod \
  -p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'

# VolumeSnapshot
kubectl get volumesnapshot -A
kubectl create -f snapshot.yamloc get storageclass
oc describe pvc data-web-0

Event debugging & cluster triage

CommandFinds
kubectl get events -A --field-selector type=Warning --sort-by=.lastTimestampRecent warnings cluster-wide
kubectl get pods -A --field-selector status.phase=PendingStuck scheduling
kubectl get pods -A | grep -E 'Error|CrashLoop|Evicted|OOM'Unhealthy pods
kubectl top nodesNode CPU/memory (metrics-server)
kubectl top pods -A --sort-by=memory | head -20Memory hogs
kubectl get endpoints,endpointslice -n prodEmpty endpoints = selector mismatch
kubectl describe quota,limitrange -n prodNamespace resource caps
kubectl get --raw /readyz?verboseAPI server health
kubectl get --raw /metrics | headRaw API metrics (if enabled)

Tekton & OpenShift builds

# Tekton (generic K8s)
kubectl get task,pipeline,pipelinerun -n ci
tkn pipeline start build-deploy -p git-url=https://github.com/org/app -w name=shared-workspace,claimName=build-ws
tkn pipelinerun logs -f -n cioc start-build web --from-dir=. --follow
oc logs -f bc/web
oc get builds
oc new-build nodejs~https://github.com/org/app.git
oc import-image nodejs:18 --confirm

cert-manager & External Secrets

# cert-manager Certificate
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: app-tls
spec:
  secretName: app-tls
  issuerRef: { name: letsencrypt-prod, kind: ClusterIssuer }
  dnsNames: [app.example.com]

kubectl get certificate,certificaterequest,clusterissuer -A
kubectl describe certificate app-tls -n prod

# External Secrets Operator
kubectl get externalsecret,secretstore -A
⚠️ Pitfall

Never run kubectl apply from CI to production—push to GitOps repo instead. Drift + no audit trail + race conditions with concurrent pipelines.

Architect / Platform Cheat Sheet

etcd, scheduling, storage, security, upgrades, DR, multi-cluster. Chapters: Architecture, Scheduling, Multi-Cluster, Production Ops.

architect

etcd — health, backup & restore

OperationCommandNotes
Cluster healthetcdctl endpoint health --clusterAll members must be healthy
Leader checketcdctl endpoint status -w tableWatch RAFT INDEX lag
Member listetcdctl member list -w tableOdd count: 3 or 5
Snapshotetcdctl snapshot save /backup/etcd.dbStop writes not required (consistent)
Verify snapshotetcdctl snapshot status /backup/etcd.db -w tableCheck hash + revision
Defragetcdctl defrag --clusterReclaim space; brief latency spike
Alarm listetcdctl alarm listNOSPACE = quota exceeded
# Snapshot (on control plane node)
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%F-%H%M).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Restore procedure (disaster — simplified)
# 1. Stop kube-apiserver on all control plane nodes
# 2. ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd.db \
#      --data-dir=/var/lib/etcd-restored --name=etcd-0 \
#      --initial-cluster=etcd-0=https://10.0.0.1:2380 \
#      --initial-advertise-peer-urls=https://10.0.0.1:2380
# 3. Update etcd manifest data-dir; restart etcd → apiserver

# OpenShift
oc adm etcd-backup --backup-dir=/backup
oc adm etcd-snapshot-backup --name=pre-upgrade-$(date +%F)

Scheduling decision matrix

RequirementPrimitiveEffectAvoid
Dedicated GPU/spot nodesTaint + TolerationRepel/allow specific nodesHardcoded nodeName
Run on SSD nodes onlynodeSelector or nodeAffinityHard/soft label matchTaints for simple labels
Co-locate app + sidecar cachepodAffinitySame node/zoneShared emptyDir across nodes
One replica per node (HA)podAntiAffinity or topologySpreadSpread across nodesreplicas > node count
Even spread across AZstopologySpreadConstraintsmaxSkew across zonesManual per-zone deploys
Priority during evictionpriorityClassNameHigh priority survives pressureBestEffort for critical apps
PreemptionPriorityClass valueHigher evicts lowerSame priority for all
# Taint GPU node
kubectl taint nodes gpu-1 nvidia.com/gpu=present:NoSchedule

# Toleration in pod spec
tolerations:
  - key: nvidia.com/gpu
    operator: Equal
    value: present
    effect: NoSchedule

# Topology spread across zones
topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels: { app: web }

Taints, effects & use cases

EffectNew podsExisting podsUse case
NoScheduleBlocked unless toleratedUntouchedGPU, dedicated pools
PreferNoScheduleSoft avoidUntouchedPrefer on-demand over spot
NoExecuteBlockedEvicted unless toleratedNode maintenance, NotReady

Built-in taints: node.kubernetes.io/not-ready:NoExecute (5min), node.kubernetes.io/unreachable:NoExecute, node.kubernetes.io/disk-pressure:NoSchedule.

Storage — access modes & class comparison

Access modeAbbrevMeaningTypical backend
ReadWriteOnceRWOOne node read-writeEBS, Azure Disk, Ceph RBD
ReadOnlyManyROXMany nodes read-onlyNFS export, Config volumes
ReadWriteManyRWXMany nodes read-writeEFS, CephFS, Azure Files, NFS
ReadWriteOncePodRWOPSingle pod exclusive (1.22+)Block volumes needing true exclusive
BackendProvisionerAccessReclaimPolicy (DB)Binding mode
AWS EBS gp3ebs.csi.aws.comRWORetainWaitForFirstConsumer
GCE PDpd.csi.storage.gke.ioRWORetainWaitForFirstConsumer
Azure Diskdisk.csi.azure.comRWORetainWaitForFirstConsumer
AWS EFSefs.csi.aws.comRWXRetainImmediate
Rook-Ceph RBDrook-ceph.rbd.csi.ceph.comRWORetainWaitForFirstConsumer
OCP ODF RBDocs-storagecluster-ceph-rbdRWORetainWaitForFirstConsumer
OCP ODF CephFSocs-storagecluster-cephfsRWXRetainImmediate
Longhorndriver.longhorn.ioRWO/RWXRetainImmediate

OpenShift SCC reference

SCCrunAsUserCapabilitiesWhen to grant
restricted-v2Random non-rootDroppedDefault — no action needed
nonroot-v2Specific non-root UIDDroppedApp requires fixed non-root UID
anyuidAny UID including rootDroppedLegacy containers as root
privilegedAnyAllOperators/system only — never apps
hostnetwork-v2Non-rootHost networkDaemonSets needing hostNetwork
hostmount-anyuidAnyHost path mountsLegacy hostPath requirements
# Diagnose SCC failure
oc describe pod failing-pod | tail -20
oc get scc
oc adm policy who-can use scc privileged

# Grant SCC to ServiceAccount (minimal privilege)
oc adm policy add-scc-to-user nonroot-v2 -z myapp-sa -n prod
oc adm policy add-scc-to-user anyuid -z legacy-sa -n legacy

Cluster upgrade checklist

#StepCommand / action
1Check deprecated APIspluto detect-files -d manifests/
2Verify PDBs existkubectl get pdb -A
3Confirm node capacityEnough spare nodes for rolling drain
4Backup etcdetcdctl snapshot save or oc adm etcd-snapshot-backup
5Velero backupvelero backup create pre-upgrade-$(date +%F)
6Upgrade control planeKubeadm / managed service / oc adm upgrade
7Upgrade workersOne node at a time; drain → upgrade → uncordon
8Verify operatorsoc get clusteroperators / CRD health
9Smoke testkubectl get --raw /readyz?verbose
10Validate workloadsCritical app health checks + synthetic probes
# OpenShift upgrade
oc adm upgrade
oc adm upgrade --to=4.16.8
oc adm upgrade --to-latest=true
oc get clusterversion
oc get mcp -w                         # MachineConfigPool progress
oc get co -w                          # ClusterOperators

# K8s version policy: stay within N-2 of latest minor

Velero — install, backup, restore, schedule

# Install (AWS example)
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.9.0 \
  --bucket k8s-backups-prod \
  --secret-file ./credentials-velero \
  --backup-location-config region=us-east-1 \
  --snapshot-location-config region=us-east-1 \
  --use-volume-snapshots=true \
  --use-node-agent=false

# Backup
velero backup create daily-$(date +%F) \
  --include-namespaces prod,staging \
  --exclude-resources events,events.events.k8s.io \
  --ttl 720h0m0s

velero backup describe daily-2026-06-05 --details
velero backup logs daily-2026-06-05

# Schedule
velero schedule create nightly \
  --schedule="0 2 * * *" \
  --include-namespaces prod \
  --ttl 168h

# Restore
velero restore create restore-$(date +%F) \
  --from-backup daily-2026-06-05 \
  --include-namespaces prod

velero restore describe restore-2026-06-05
kubectl get restores.velero.io -n velero

Capacity planning & control plane sizing

Cluster scaleControl plane (each)etcdNotes
< 100 nodes2 CPU / 8 GB2 CPU / 8 GB SSD3 CP nodes for HA
100–500 nodes4 CPU / 16 GB4 CPU / 8 GB NVMeDedicated etcd disk <10ms latency
500–3000 nodes8 CPU / 32 GB8 CPU / 16 GB NVMeWatch API 429 rate limits
3000+ nodes16+ CPU / 64 GB8 CPU / 32 GB NVMeConsider etcd defrag schedule
RuleGuidance
Node memory headroomSum of pod memory requests + 1–2 GB OS overhead per node
Cluster autoscaler buffer10–20% spare capacity for burst scheduling
etcd quotaDefault 2 GB; monitor etcd_mvcc_db_total_size_in_bytes
Max pods per nodeDefault 110; reduce if dense memory workloads
API server throttling429 responses → increase priority/fairness or scale CP

Production failure matrix

SymptomRoot causesFix commands
ImagePullBackOffWrong tag, private registry, missing pull secretdescribe pod; create docker-registry secret
CrashLoopBackOffApp error, bad CMD, missing config, probe too aggressivelogs --previous; relax startupProbe
PendingInsufficient CPU/mem, taints, affinity, unbound PVCdescribe pod; get nodes; get pvc
OOMKilledMemory limit too lowRaise limit; profile heap; VPA recommendation
EvictedNode disk/memory pressureClean images; add nodes; fix disk usage
Service unreachableSelector mismatch, empty endpoints, NetworkPolicyget endpoints; test from netshoot pod
SCC violation (OCP)Root user, missing capability, hostPathoc describe pod; grant correct SCC to SA
etcd NOSPACEQuota exceededDefrag; compact; raise quota; restore from snapshot

Multi-cluster & service mesh decisions

PatternToolWhen
Hub-spoke fleet managementRed Hat ACM + HivePolicy, apps, upgrades across 10+ clusters
Hosted control planesHyperShift / ROSA HCPFast cheap cluster provisioning
Declarative cluster lifecycleCluster API (CAPI)GitOps-native cluster create/destroy
App delivery multi-clusterArgoCD ApplicationSetSame app to clusters by label
mTLS + traffic managementIstio / OpenShift Service MeshMicroservices observability + zero-trust
Lightweight mTLSLinkerdSimpler mesh, lower overhead

Disaster recovery — RTO/RPO targets

StrategyRPORTOComplexity
etcd snapshot onlyHours (backup interval)HoursLow
Velero + volume snapshotsMinutes–hoursHoursMedium
Active-passive clusterNear-zero (async replication)MinutesHigh
Active-active multi-regionNear-zeroMinutesVery high

oc adm — platform administration

oc adm top nodes
oc adm top pods -A
oc adm groups sync --sync-config=ldap-sync.yaml --confirm
oc adm node-logs --role=master --path=openshift-apiserver/
oc adm must-gather                                    # support bundle
oc adm release info 4.16.8                          # release metadata
oc adm upgrade --allow-explicit-upgrade --to=4.16.8
oc adm pod-network join-projects --selector env=prod  # join network
oc adm policy scc-subject review -z myapp-sa -n prod  # which SCC applies
oc get clusteroperators                             # platform health
oc get co -o custom-columns=NAME:.metadata.name,AVAILABLE:.status.conditions[?(@.type=="Available")].status
oc get nodes -o wide
oc debug node/worker-1 -- chroot /host bash         # node shell (OCP 4)

GitOps & platform architecture checklist

#DecisionRecommendation
1Package managerHelm for third-party; Kustomize for in-house apps
2GitOps engineArgoCD (UI + ApplicationSet) or Flux (modular)
3Secrets in gitSealed Secrets, External Secrets, or SOPS — never plain
4IngressGateway API (future); Ingress + cert-manager today
5Network policyDefault-deny + explicit allow per app
6Pod securityPSA restricted (K8s) + SCC restricted-v2 (OCP)
7ObservabilityPrometheus + Grafana + Loki + OTel (or OCP built-in monitoring)
8Backupetcd snapshot + Velero for namespace/PV DR
9Multi-tenancyNamespace isolation + RBAC + NetworkPolicy + quotas
10Cluster upgradesN-2 policy; PDB; blue-green cluster for zero-downtime migrations
🎯 Interview Tip

System design whiteboard: draw control plane (API → etcd), explain scheduler filter/score/bind, mention PDB + topology spread for HA, etcd backup for RPO, and GitOps for drift control. For OCP, add SCC + Routes + Operators.

⚖️ Trade-off

Retain reclaim policy protects database PVs but requires manual cleanup — automate with Velero TTL policies or a PV janitor CronJob.