Cheat Sheets

Three exhaustive quick references by persona—daily kubectl/oc workflows, platform operations, and cluster architecture decisions. Use Copy sheet for plain text, the CLI toggle for OpenShift variants, or print (Cmd/Ctrl+P).

developer devops architect K8s 1.28+ OCP 4.16+ CKA/CKAD/CKS

Developer Cheat Sheet

Daily commands, pod lifecycle, YAML templates, debugging, networking. Chapters: Workloads, Configuration, Networking.

developer

Context, namespace & discovery

Task	Command	Notes
List contexts	`kubectl config get-contexts`	`*` = current
Switch context	`kubectl config use-context prod-east`	Cluster + user + namespace
Current context	`kubectl config current-context`	—
Set default namespace	`kubectl config set-context --current --namespace=dev`	Avoid `-n` on every command
List all namespaces	`kubectl get ns`	`oc get projects` on OCP
Explain API field	`kubectl explain pod.spec.containers.livenessProbe`	Recursive: `--recursive`
API resources	`kubectl api-resources`	Short names: `po`, `deploy`, `svc`
Watch resource	`kubectl get pods -w`	`--watch-only` skips initial list

kubectl get — common flags

Flag	Purpose	Example
`-n`	Namespace	`kubectl get pods -n prod`
`-A` / `--all-namespaces`	All namespaces	`kubectl get pods -A`
`-l`	Label selector	`-l app=web,env=prod`
`-o wide`	Extra columns (node, IP)	`kubectl get pods -o wide`
`-o yaml` / `-o json`	Full manifest	Pipe to `grep` / save template
`-o name`	Resource name only	For loops: `for p in $(kubectl get pods -o name); do …`
`--sort-by`	Sort output	`--sort-by=.metadata.creationTimestamp`
`--field-selector`	Filter by field	`status.phase=Pending`
`--show-labels`	Print labels column	Debug selector mismatches
`-w`	Watch stream	Live deploy monitoring

Apply, create & imperative shortcuts

Command	Purpose	Notes
`kubectl apply -f manifest.yaml`	Declarative create/update	Preferred for GitOps
`kubectl apply -f dir/`	Apply all YAML in directory	Recursive with `-R`
`kubectl apply -k overlays/dev`	Kustomize overlay	Built-in Kustomize
`kubectl apply --dry-run=client -o yaml`	Client-side dry run	Generate manifest without server
`kubectl apply --server-side`	Server-side apply	Field ownership tracking
`kubectl diff -f manifest.yaml`	Preview changes	CI gate before apply
`kubectl create deployment web --image=nginx:1.25`	Imperative deploy	Quick test; export to YAML after
`kubectl expose deployment web --port=80`	Create Service	ClusterIP by default
`kubectl scale deployment web --replicas=5`	Scale replicas	HPA overrides manual scale
`kubectl delete -f manifest.yaml`	Delete by manifest	`--ignore-not-found` in scripts
`kubectl replace --force -f pod.yaml`	Delete + recreate	Disruptive; avoid in prod

# Export running deployment to YAML (strip cluster metadata)
kubectl get deploy web -o yaml | kubectl neat > web.yaml

# Generate YAML without creating (exam pattern)
kubectl create deployment web --image=nginx --dry-run=client -o yaml > deploy.yamloc new-app --name=web nginx:1.25
oc expose dc/web
oc get route web -o jsonpath='{.spec.host}'
oc status
oc logs -f dc/web

Pod lifecycle, phases & conditions

Status / Phase	Meaning	First diagnostic step
`Pending`	Not yet scheduled or image pulling	`kubectl describe pod` → Events; check scheduler/taints/PVC
`ContainerCreating`	Scheduled; pulling image / mounting volumes	Events; `describe` volume mounts
`Running`	At least one container running	Check readiness if not receiving traffic
`Succeeded`	All containers exited 0 (Job)	Expected for batch; check Job status
`Failed`	Container exited non-zero	`kubectl logs --previous`
`CrashLoopBackOff`	Container crash → backoff retry	`logs --previous`; check exit code, probes, config
`ImagePullBackOff`	Cannot pull image	Verify tag, registry auth, imagePullSecrets
`ErrImagePull`	Immediate pull failure	Same as ImagePullBackOff
`OOMKilled`	Memory limit exceeded	Raise `limits.memory` or fix leak
`Evicted`	Node pressure eviction	`kubectl describe node`; disk/memory pressure
`Terminating`	Deletion in progress	Stuck? Check finalizers, PDB, grace period

Condition	True means
`PodScheduled`	Assigned to a node
`Initialized`	Init containers completed
`ContainersReady`	All containers passed readiness
`Ready`	Pod eligible for Service endpoints

Probes — liveness, readiness, startup

Probe	Purpose	Failure action	Typical settings
readinessProbe	Ready for traffic?	Removed from Service endpoints	`periodSeconds: 10`, `failureThreshold: 3`
livenessProbe	Process alive?	Container restart	Less aggressive than readiness; avoid DB checks
startupProbe	Slow-start apps	Blocks liveness until success	`failureThreshold: 30` for JVM warm-up

readinessProbe:
  httpGet: { path: /health/ready, port: 8080, scheme: HTTP }
  initialDelaySeconds: 5
  periodSeconds: 10
  timeoutSeconds: 2
  failureThreshold: 3
livenessProbe:
  httpGet: { path: /health/live, port: 8080 }
  periodSeconds: 20
  failureThreshold: 3
startupProbe:
  httpGet: { path: /health/ready, port: 8080 }
  periodSeconds: 5
  failureThreshold: 30   # 150s startup budget

Resources, QoS & limits

QoS class	Condition	Eviction order
Guaranteed	requests == limits for all containers (cpu+memory)	Last evicted
Burstable	requests set, limits differ or partial	Middle
BestEffort	No requests or limits	First evicted

Field	Used for	Production guidance
`requests.cpu`	Scheduling + HPA %	Set based on p95 usage
`limits.cpu`	CPU throttling cap	Omit or set high for latency-sensitive apps
`requests.memory`	Scheduling	Set == limits for Guaranteed QoS on stateful
`limits.memory`	Hard cap → OOMKill	Always set; prevents node-wide OOM

resources:
  requests: { cpu: 250m, memory: 512Mi }
  limits:   { cpu: "1", memory: 512Mi }   # Guaranteed memory QoS

kubectl top pod web-abc -n prod
kubectl describe pod web-abc | grep -A5 "Limits\|Requests\|QoS"

Logs, exec, debug & copy

Command	Purpose
`kubectl logs -f deploy/web -c app --tail=200`	Stream logs from deployment container
`kubectl logs pod/web-abc --previous`	Logs from crashed container instance
`kubectl logs -l app=web --all-containers --prefix`	All pods matching label
`kubectl logs pod/web-abc --since=10m --timestamps`	Time-bounded logs
`kubectl exec -it pod/web-abc -c app -- sh`	Interactive shell
`kubectl exec deploy/web -- curl -s localhost:8080/health`	One-off command
`kubectl debug pod/web-abc -it --image=busybox --target=app`	Ephemeral debug container (K8s 1.23+)
`kubectl debug node/node-1 -it --image=ubuntu`	Node shell via privileged debug pod
`kubectl cp pod/web-abc:/var/log/app.log ./app.log`	Copy files from pod
`kubectl attach pod/web-abc -c app -i -t`	Attach to running process stdin
`kubectl get events -n dev --sort-by=.lastTimestamp`	Recent cluster events

# Debug toolbox pod (curl, dig, tcpdump, iperf)
kubectl run netshoot --rm -it --restart=Never \
  --image=nicolaka/netshoot -- bash

# Test service DNS from inside cluster
kubectl run tmp --rm -it --restart=Never --image=busybox -- \
  wget -qO- http://web.dev.svc.cluster.local/healthoc logs -f dc/web
oc rsh pod/web-abc
oc exec pod/web-abc -- curl -s localhost:8080
oc debug pod/web-abc -it --image=registry.redhat.io/ubi9/ubi-minimal

Port-forward, proxy & local access

Command	Maps	Use when
`kubectl port-forward svc/web 8080:80`	Service → local	Test app without Ingress
`kubectl port-forward pod/web-abc 8080:8080`	Pod → local	Debug specific pod instance
`kubectl port-forward deploy/web 8080:8080`	Deployment (any pod)	Load-balanced pick
`kubectl port-forward -n kube-system svc/prometheus 9090:9090`	Cross-namespace	Local Grafana/Prometheus access
`kubectl proxy --port=8001`	API server proxy	`localhost:8001/api/v1/namespaces`

DNS & service discovery

Pattern	Resolves to
`<service>`	Same namespace Service ClusterIP
`<service>.<namespace>`	Cross-namespace Service
`<service>.<namespace>.svc.cluster.local`	FQDN (always works)
`<pod-ip-dashed>.<namespace>.pod.cluster.local`	Pod IP directly (headless)
`<statefulset-0>.<headless-svc>.<ns>.svc.cluster.local`	StatefulSet pod DNS

# Inside pod — check DNS
cat /etc/resolv.conf          # nameserver = CoreDNS (10.96.0.10 typical)
nslookup web
dig +short web.dev.svc.cluster.local

Deployment + Service YAML (production baseline)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
  labels: { app: web, app.kubernetes.io/name: web }
spec:
  replicas: 3
  revisionHistoryLimit: 5
  strategy:
    type: RollingUpdate
    rollingUpdate: { maxUnavailable: 0, maxSurge: 1 }
  selector:
    matchLabels: { app: web }
  template:
    metadata:
      labels: { app: web }
    spec:
      serviceAccountName: web
      securityContext:
        runAsNonRoot: true
        seccompProfile: { type: RuntimeDefault }
      containers:
        - name: app
          image: nginx:1.25-alpine@sha256:…
          ports: [{ name: http, containerPort: 8080 }]
          resources:
            requests: { cpu: 100m, memory: 128Mi }
            limits:   { cpu: 500m, memory: 256Mi }
          readinessProbe:
            httpGet: { path: /health/ready, port: http }
            periodSeconds: 10
          livenessProbe:
            httpGet: { path: /health/live, port: http }
            periodSeconds: 20
          securityContext:
            allowPrivilegeEscalation: false
            capabilities: { drop: [ALL] }
---
apiVersion: v1
kind: Service
metadata:
  name: web
spec:
  type: ClusterIP
  selector: { app: web }
  ports:
    - name: http
      port: 80
      targetPort: http

ConfigMap, Secret & Ingress YAML

# ConfigMap — file + key
apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
data:
  app.properties: |
    server.port=8080
    feature.flags=beta
  LOG_LEVEL: info
---
# Secret — Opaque (base64 values; enable encryption at rest in prod)
apiVersion: v1
kind: Secret
metadata:
  name: app-secrets
type: Opaque
stringData:                    # prefer stringData over data in git templates
  DB_PASSWORD: changeme
---
# Mount in pod
envFrom:
  - configMapRef: { name: app-config }
env:
  - name: DB_PASSWORD
    valueFrom:
      secretKeyRef: { name: app-secrets, key: DB_PASSWORD }
volumeMounts:
  - name: config-vol
    mountPath: /etc/app
    readOnly: true
volumes:
  - name: config-vol
    configMap:
      name: app-config
---
# Ingress (requires ingress controller)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: web
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  ingressClassName: nginx
  tls: [{ hosts: [app.example.com], secretName: app-tls }]
  rules:
    - host: app.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service: { name: web, port: { number: 80 } }

Job, CronJob & HPA snippets

# Job — run to completion
apiVersion: batch/v1
kind: Job
metadata:
  name: migrate-db
spec:
  backoffLimit: 3
  activeDeadlineSeconds: 600
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: migrate
          image: myapp:migrate-v3
          command: ["./migrate.sh"]
---
# CronJob
apiVersion: batch/v1
kind: CronJob
metadata:
  name: nightly-report
spec:
  schedule: "0 2 * * *"
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: report
              image: myapp:report
---
# HPA — requires metrics-server + CPU requests set
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target: { type: Utilization, averageUtilization: 70 }

Rollout commands (developer view)

kubectl rollout status deployment/web
kubectl rollout history deployment/web
kubectl rollout undo deployment/web
kubectl rollout undo deployment/web --to-revision=3
kubectl rollout pause deployment/web    # batch manifest changes
kubectl rollout resume deployment/web
kubectl set image deployment/web app=myreg/web:v2.4.1
kubectl set env deployment/web FEATURE_X=trueoc rollout status dc/web
oc rollout history dc/web
oc set env dc/web FEATURE_X=true
oc tag myapp:v2.4.1 web:latest          # ImageStream trigger

💡 Pro Tip

kubectl explain + --dry-run=client -o yaml covers most CKAD exam YAML generation—memorize probe fields and resource units (100m, 128Mi).

DevOps / Platform Cheat Sheet

RBAC, Helm, Kustomize, GitOps, node ops, networking, observability. Chapters: RBAC, Helm, GitOps, Production Ops.

devops

RBAC — audit & grant commands

Command	Purpose
`kubectl auth can-i create pods`	Check your permissions
`kubectl auth can-i create pods --as=system:serviceaccount:ns:sa`	Check SA permissions
`kubectl auth can-i --list --as=alice@corp.com`	List all allowed verbs
`kubectl auth can-i get secrets -n prod --as=group:dev-team`	Check group access
`kubectl create rolebinding dev-edit --clusterrole=edit --user=alice -n dev`	Grant namespace edit
`kubectl create clusterrolebinding alice-admin --clusterrole=cluster-admin --user=alice`	Cluster admin (avoid)
`kubectl get role,rolebinding,clusterrole,clusterrolebinding -A`	Audit bindings

# Minimal Role + RoleBinding
kubectl create role pod-reader --verb=get,list,watch --resource=pods -n dev
kubectl create rolebinding read-pods --role=pod-reader \
  --serviceaccount=dev:ci-runner -n devoc adm policy add-role-to-user edit alice -n dev
oc adm policy add-role-to-group cluster-reader ops-team
oc adm policy who-can get secrets -n prod
oc adm policy add-scc-to-user anyuid -z legacy-sa -n legacy
oc adm policy add-scc-to-group privileged -z operator-sa -n operators

Pod Security Admission (PSA) namespace labels

# Enforce restricted profile on namespace
apiVersion: v1
kind: Namespace
metadata:
  name: prod
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/enforce-version: latest
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

kubectl label namespace prod \
  pod-security.kubernetes.io/enforce=restricted --overwrite

NetworkPolicy patterns

# Default deny all ingress in namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
spec:
  podSelector: {}
  policyTypes: [Ingress]
---
# Allow ingress from same namespace + from ingress-nginx
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: web-allow
spec:
  podSelector:
    matchLabels: { app: web }
  policyTypes: [Ingress, Egress]
  ingress:
    - from:
        - podSelector: {}
        - namespaceSelector:
            matchLabels: { kubernetes.io/metadata.name: ingress-nginx }
      ports: [{ protocol: TCP, port: 8080 }]
  egress:
    - to:
        - podSelector:
            matchLabels: { app: postgres }
      ports: [{ protocol: TCP, port: 5432 }]
    - to:
        - namespaceSelector: {}
          podSelector:
            matchLabels: { k8s-app: kube-dns }
      ports: [{ protocol: UDP, port: 53 }]

Helm — install, upgrade, lifecycle

Command	Purpose
`helm repo add bitnami https://charts.bitnami.com/bitnami`	Add chart repository
`helm repo update`	Refresh chart index
`helm search repo redis`	Find charts
`helm show values bitnami/redis`	Default values
`helm install redis bitnami/redis -f values.yaml -n cache --create-namespace`	Install release
`helm upgrade redis bitnami/redis -f values.yaml --atomic --timeout 10m`	Upgrade with rollback on fail
`helm rollback redis 2`	Rollback to revision
`helm history redis`	Revision list
`helm get values redis -a`	Deployed values
`helm template myapp ./chart -f prod.yaml`	Render without install
`helm uninstall redis -n cache`	Remove release
`helm lint ./chart`	Validate chart
`helm plugin install https://github.com/databus23/helm-diff`	Diff before upgrade

# Production upgrade pattern
helm diff upgrade redis bitnami/redis -f values-prod.yaml -n cache
helm upgrade redis bitnami/redis -f values-prod.yaml -n cache \
  --atomic --cleanup-on-fail --timeout 15m --wait

# OCI registry charts
helm install myapp oci://registry.example.com/charts/myapp --version 1.2.3

Kustomize — base, overlays, patches

# Directory layout
# base/kustomization.yaml  deployment.yaml  service.yaml
# overlays/dev/kustomization.yaml
# overlays/prod/kustomization.yaml

# base/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources: [deployment.yaml, service.yaml]
commonLabels: { app.kubernetes.io/managed-by: kustomize }

# overlays/prod/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources: [../../base]
namespace: prod
namePrefix: prod-
replicas: [{ name: web, count: 5 }]
images:
  - name: myapp
    newName: registry.example.com/myapp
    newTag: v2.4.1
patches:
  - path: resources-patch.yaml
  - target: { kind: Deployment, name: web }
    patch: |
      - op: replace
        path: /spec/template/spec/containers/0/resources/limits/memory
        value: 1Gi
configMapGenerator:
  - name: app-config
    literals: [ENV=prod]
secretGenerator:
  - name: app-secrets
    envs: [.env.prod]

kubectl apply -k overlays/prod
kustomize build overlays/prod | kubectl diff -f -

ArgoCD & Flux GitOps

Tool	Command	Purpose
ArgoCD	`argocd app list`	List applications
ArgoCD	`argocd app sync myapp --prune`	Sync + delete orphans
ArgoCD	`argocd app diff myapp`	Live vs git diff
ArgoCD	`argocd app rollback myapp`	Rollback deployment history
ArgoCD	`argocd app set myapp --sync-policy automated --self-heal`	Enable auto-sync
Flux	`flux get kustomizations -A`	Reconciliation status
Flux	`flux reconcile source git my-repo`	Force git pull
Flux	`flux reconcile kustomization apps --with-source`	Force full sync
Flux	`flux suspend kustomization apps`	Pause reconciliation

# ArgoCD Application CR (minimal)
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: web
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/org/gitops.git
    path: apps/web/overlays/prod
    targetRevision: main
  destination:
    server: https://kubernetes.default.svc
    namespace: prod
  syncPolicy:
    automated: { prune: true, selfHeal: true }
    syncOptions: [CreateNamespace=true]

Node management — cordon, drain, uncordon

# Safe node maintenance sequence
kubectl cordon node-1                          # no new pods
kubectl drain node-1 \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --grace-period=60 \
  --timeout=15m
kubectl uncordon node-1

# Check node conditions
kubectl describe node node-1 | grep -A5 Conditions
kubectl get nodes -o custom-columns=NAME:.metadata.name,STATUS:.status.conditions[-1].type,TAINTS:.spec.taints

# Taint node for dedicated workload
kubectl taint nodes gpu-1 nvidia.com/gpu=true:NoSchedule
kubectl taint nodes gpu-1 nvidia.com/gpu-:NoSchedule   # removeoc adm cordon node-1
oc adm drain node-1 --ignore-daemonsets --delete-emptydir-data --force
oc adm uncordon node-1
oc get machineconfigpool                          # OCP node update pools

Rollouts, PDB & HPA (platform)

# PodDisruptionBudget — required before drain
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels: { app: web }

kubectl get pdb -A
kubectl rollout status deployment/web -w --timeout=5m

# Force restart all pods (new spec, same image)
kubectl rollout restart deployment/web

Storage — PVC, StorageClass, snapshots

kubectl get sc
kubectl get pvc,pv -A
kubectl describe pvc data-web-0 -n prod

# Expand PVC (StorageClass must allowVolumeExpansion: true)
kubectl patch pvc data-web-0 -n prod \
  -p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'

# VolumeSnapshot
kubectl get volumesnapshot -A
kubectl create -f snapshot.yamloc get storageclass
oc describe pvc data-web-0

Event debugging & cluster triage

Command	Finds
`kubectl get events -A --field-selector type=Warning --sort-by=.lastTimestamp`	Recent warnings cluster-wide
`kubectl get pods -A --field-selector status.phase=Pending`	Stuck scheduling
`kubectl get pods -A \| grep -E 'Error\|CrashLoop\|Evicted\|OOM'`	Unhealthy pods
`kubectl top nodes`	Node CPU/memory (metrics-server)
`kubectl top pods -A --sort-by=memory \| head -20`	Memory hogs
`kubectl get endpoints,endpointslice -n prod`	Empty endpoints = selector mismatch
`kubectl describe quota,limitrange -n prod`	Namespace resource caps
`kubectl get --raw /readyz?verbose`	API server health
`kubectl get --raw /metrics \| head`	Raw API metrics (if enabled)

Tekton & OpenShift builds

# Tekton (generic K8s)
kubectl get task,pipeline,pipelinerun -n ci
tkn pipeline start build-deploy -p git-url=https://github.com/org/app -w name=shared-workspace,claimName=build-ws
tkn pipelinerun logs -f -n cioc start-build web --from-dir=. --follow
oc logs -f bc/web
oc get builds
oc new-build nodejs~https://github.com/org/app.git
oc import-image nodejs:18 --confirm

cert-manager & External Secrets

# cert-manager Certificate
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: app-tls
spec:
  secretName: app-tls
  issuerRef: { name: letsencrypt-prod, kind: ClusterIssuer }
  dnsNames: [app.example.com]

kubectl get certificate,certificaterequest,clusterissuer -A
kubectl describe certificate app-tls -n prod

# External Secrets Operator
kubectl get externalsecret,secretstore -A

⚠️ Pitfall

Never run kubectl apply from CI to production—push to GitOps repo instead. Drift + no audit trail + race conditions with concurrent pipelines.

Architect / Platform Cheat Sheet

etcd, scheduling, storage, security, upgrades, DR, multi-cluster. Chapters: Architecture, Scheduling, Multi-Cluster, Production Ops.

architect

etcd — health, backup & restore

Operation	Command	Notes
Cluster health	`etcdctl endpoint health --cluster`	All members must be healthy
Leader check	`etcdctl endpoint status -w table`	Watch `RAFT INDEX` lag
Member list	`etcdctl member list -w table`	Odd count: 3 or 5
Snapshot	`etcdctl snapshot save /backup/etcd.db`	Stop writes not required (consistent)
Verify snapshot	`etcdctl snapshot status /backup/etcd.db -w table`	Check hash + revision
Defrag	`etcdctl defrag --cluster`	Reclaim space; brief latency spike
Alarm list	`etcdctl alarm list`	`NOSPACE` = quota exceeded

# Snapshot (on control plane node)
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%F-%H%M).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Restore procedure (disaster — simplified)
# 1. Stop kube-apiserver on all control plane nodes
# 2. ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd.db \
#      --data-dir=/var/lib/etcd-restored --name=etcd-0 \
#      --initial-cluster=etcd-0=https://10.0.0.1:2380 \
#      --initial-advertise-peer-urls=https://10.0.0.1:2380
# 3. Update etcd manifest data-dir; restart etcd → apiserver

# OpenShift
oc adm etcd-backup --backup-dir=/backup
oc adm etcd-snapshot-backup --name=pre-upgrade-$(date +%F)

Scheduling decision matrix

Requirement	Primitive	Effect	Avoid
Dedicated GPU/spot nodes	Taint + Toleration	Repel/allow specific nodes	Hardcoded nodeName
Run on SSD nodes only	nodeSelector or nodeAffinity	Hard/soft label match	Taints for simple labels
Co-locate app + sidecar cache	podAffinity	Same node/zone	Shared emptyDir across nodes
One replica per node (HA)	podAntiAffinity or topologySpread	Spread across nodes	replicas > node count
Even spread across AZs	topologySpreadConstraints	maxSkew across zones	Manual per-zone deploys
Priority during eviction	priorityClassName	High priority survives pressure	BestEffort for critical apps
Preemption	PriorityClass value	Higher evicts lower	Same priority for all

# Taint GPU node
kubectl taint nodes gpu-1 nvidia.com/gpu=present:NoSchedule

# Toleration in pod spec
tolerations:
  - key: nvidia.com/gpu
    operator: Equal
    value: present
    effect: NoSchedule

# Topology spread across zones
topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels: { app: web }

Taints, effects & use cases

Effect	New pods	Existing pods	Use case
`NoSchedule`	Blocked unless tolerated	Untouched	GPU, dedicated pools
`PreferNoSchedule`	Soft avoid	Untouched	Prefer on-demand over spot
`NoExecute`	Blocked	Evicted unless tolerated	Node maintenance, NotReady

Built-in taints: node.kubernetes.io/not-ready:NoExecute (5min), node.kubernetes.io/unreachable:NoExecute, node.kubernetes.io/disk-pressure:NoSchedule.

Storage — access modes & class comparison

Access mode	Abbrev	Meaning	Typical backend
ReadWriteOnce	RWO	One node read-write	EBS, Azure Disk, Ceph RBD
ReadOnlyMany	ROX	Many nodes read-only	NFS export, Config volumes
ReadWriteMany	RWX	Many nodes read-write	EFS, CephFS, Azure Files, NFS
ReadWriteOncePod	RWOP	Single pod exclusive (1.22+)	Block volumes needing true exclusive

Backend	Provisioner	Access	ReclaimPolicy (DB)	Binding mode
AWS EBS gp3	ebs.csi.aws.com	RWO	Retain	WaitForFirstConsumer
GCE PD	pd.csi.storage.gke.io	RWO	Retain	WaitForFirstConsumer
Azure Disk	disk.csi.azure.com	RWO	Retain	WaitForFirstConsumer
AWS EFS	efs.csi.aws.com	RWX	Retain	Immediate
Rook-Ceph RBD	rook-ceph.rbd.csi.ceph.com	RWO	Retain	WaitForFirstConsumer
OCP ODF RBD	ocs-storagecluster-ceph-rbd	RWO	Retain	WaitForFirstConsumer
OCP ODF CephFS	ocs-storagecluster-cephfs	RWX	Retain	Immediate
Longhorn	driver.longhorn.io	RWO/RWX	Retain	Immediate

OpenShift SCC reference

SCC	runAsUser	Capabilities	When to grant
`restricted-v2`	Random non-root	Dropped	Default — no action needed
`nonroot-v2`	Specific non-root UID	Dropped	App requires fixed non-root UID
`anyuid`	Any UID including root	Dropped	Legacy containers as root
`privileged`	Any	All	Operators/system only — never apps
`hostnetwork-v2`	Non-root	Host network	DaemonSets needing hostNetwork
`hostmount-anyuid`	Any	Host path mounts	Legacy hostPath requirements

# Diagnose SCC failure
oc describe pod failing-pod | tail -20
oc get scc
oc adm policy who-can use scc privileged

# Grant SCC to ServiceAccount (minimal privilege)
oc adm policy add-scc-to-user nonroot-v2 -z myapp-sa -n prod
oc adm policy add-scc-to-user anyuid -z legacy-sa -n legacy

Cluster upgrade checklist

#	Step	Command / action
1	Check deprecated APIs	`pluto detect-files -d manifests/`
2	Verify PDBs exist	`kubectl get pdb -A`
3	Confirm node capacity	Enough spare nodes for rolling drain
4	Backup etcd	`etcdctl snapshot save` or `oc adm etcd-snapshot-backup`
5	Velero backup	`velero backup create pre-upgrade-$(date +%F)`
6	Upgrade control plane	Kubeadm / managed service / `oc adm upgrade`
7	Upgrade workers	One node at a time; drain → upgrade → uncordon
8	Verify operators	`oc get clusteroperators` / CRD health
9	Smoke test	`kubectl get --raw /readyz?verbose`
10	Validate workloads	Critical app health checks + synthetic probes

# OpenShift upgrade
oc adm upgrade
oc adm upgrade --to=4.16.8
oc adm upgrade --to-latest=true
oc get clusterversion
oc get mcp -w                         # MachineConfigPool progress
oc get co -w                          # ClusterOperators

# K8s version policy: stay within N-2 of latest minor

Velero — install, backup, restore, schedule

# Install (AWS example)
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.9.0 \
  --bucket k8s-backups-prod \
  --secret-file ./credentials-velero \
  --backup-location-config region=us-east-1 \
  --snapshot-location-config region=us-east-1 \
  --use-volume-snapshots=true \
  --use-node-agent=false

# Backup
velero backup create daily-$(date +%F) \
  --include-namespaces prod,staging \
  --exclude-resources events,events.events.k8s.io \
  --ttl 720h0m0s

velero backup describe daily-2026-06-05 --details
velero backup logs daily-2026-06-05

# Schedule
velero schedule create nightly \
  --schedule="0 2 * * *" \
  --include-namespaces prod \
  --ttl 168h

# Restore
velero restore create restore-$(date +%F) \
  --from-backup daily-2026-06-05 \
  --include-namespaces prod

velero restore describe restore-2026-06-05
kubectl get restores.velero.io -n velero

Capacity planning & control plane sizing

Cluster scale	Control plane (each)	etcd	Notes
< 100 nodes	2 CPU / 8 GB	2 CPU / 8 GB SSD	3 CP nodes for HA
100–500 nodes	4 CPU / 16 GB	4 CPU / 8 GB NVMe	Dedicated etcd disk <10ms latency
500–3000 nodes	8 CPU / 32 GB	8 CPU / 16 GB NVMe	Watch API 429 rate limits
3000+ nodes	16+ CPU / 64 GB	8 CPU / 32 GB NVMe	Consider etcd defrag schedule

Rule	Guidance
Node memory headroom	Sum of pod memory requests + 1–2 GB OS overhead per node
Cluster autoscaler buffer	10–20% spare capacity for burst scheduling
etcd quota	Default 2 GB; monitor `etcd_mvcc_db_total_size_in_bytes`
Max pods per node	Default 110; reduce if dense memory workloads
API server throttling	429 responses → increase priority/fairness or scale CP

Production failure matrix

Symptom	Root causes	Fix commands
ImagePullBackOff	Wrong tag, private registry, missing pull secret	`describe pod`; create `docker-registry` secret
CrashLoopBackOff	App error, bad CMD, missing config, probe too aggressive	`logs --previous`; relax startupProbe
Pending	Insufficient CPU/mem, taints, affinity, unbound PVC	`describe pod`; `get nodes`; `get pvc`
OOMKilled	Memory limit too low	Raise limit; profile heap; VPA recommendation
Evicted	Node disk/memory pressure	Clean images; add nodes; fix disk usage
Service unreachable	Selector mismatch, empty endpoints, NetworkPolicy	`get endpoints`; test from netshoot pod
SCC violation (OCP)	Root user, missing capability, hostPath	`oc describe pod`; grant correct SCC to SA
etcd NOSPACE	Quota exceeded	Defrag; compact; raise quota; restore from snapshot

Multi-cluster & service mesh decisions

Pattern	Tool	When
Hub-spoke fleet management	Red Hat ACM + Hive	Policy, apps, upgrades across 10+ clusters
Hosted control planes	HyperShift / ROSA HCP	Fast cheap cluster provisioning
Declarative cluster lifecycle	Cluster API (CAPI)	GitOps-native cluster create/destroy
App delivery multi-cluster	ArgoCD ApplicationSet	Same app to clusters by label
mTLS + traffic management	Istio / OpenShift Service Mesh	Microservices observability + zero-trust
Lightweight mTLS	Linkerd	Simpler mesh, lower overhead

Disaster recovery — RTO/RPO targets

Strategy	RPO	RTO	Complexity
etcd snapshot only	Hours (backup interval)	Hours	Low
Velero + volume snapshots	Minutes–hours	Hours	Medium
Active-passive cluster	Near-zero (async replication)	Minutes	High
Active-active multi-region	Near-zero	Minutes	Very high

oc adm — platform administration

oc adm top nodes
oc adm top pods -A
oc adm groups sync --sync-config=ldap-sync.yaml --confirm
oc adm node-logs --role=master --path=openshift-apiserver/
oc adm must-gather                                    # support bundle
oc adm release info 4.16.8                          # release metadata
oc adm upgrade --allow-explicit-upgrade --to=4.16.8
oc adm pod-network join-projects --selector env=prod  # join network
oc adm policy scc-subject review -z myapp-sa -n prod  # which SCC applies
oc get clusteroperators                             # platform health
oc get co -o custom-columns=NAME:.metadata.name,AVAILABLE:.status.conditions[?(@.type=="Available")].status
oc get nodes -o wide
oc debug node/worker-1 -- chroot /host bash         # node shell (OCP 4)

GitOps & platform architecture checklist

#	Decision	Recommendation
1	Package manager	Helm for third-party; Kustomize for in-house apps
2	GitOps engine	ArgoCD (UI + ApplicationSet) or Flux (modular)
3	Secrets in git	Sealed Secrets, External Secrets, or SOPS — never plain
4	Ingress	Gateway API (future); Ingress + cert-manager today
5	Network policy	Default-deny + explicit allow per app
6	Pod security	PSA restricted (K8s) + SCC restricted-v2 (OCP)
7	Observability	Prometheus + Grafana + Loki + OTel (or OCP built-in monitoring)
8	Backup	etcd snapshot + Velero for namespace/PV DR
9	Multi-tenancy	Namespace isolation + RBAC + NetworkPolicy + quotas
10	Cluster upgrades	N-2 policy; PDB; blue-green cluster for zero-downtime migrations

🎯 Interview Tip

System design whiteboard: draw control plane (API → etcd), explain scheduler filter/score/bind, mention PDB + topology spread for HA, etcd backup for RPO, and GitOps for drift control. For OCP, add SCC + Routes + Operators.

⚖️ Trade-off

Retain reclaim policy protects database PVs but requires manual cleanup — automate with Velero TTL policies or a PV janitor CronJob.