Production Operations & Cluster Management
Running Kubernetes in production is less about writing YAML and more about keeping the control plane healthy, recovering from disasters, right-sizing spend, and debugging the seven pod states that wake you at 3 AM. This chapter covers upgrade cadence (K8s N-2 support), etcd and Velero DR, cost controls, interactive troubleshooting for common failures, and capacity planning for API server, etcd, and worker nodes—including OpenShift-specific oc adm workflows.
Cluster Upgrade Strategy
Kubernetes releases three minor versions per year. The upstream support window is N-2: if the latest is 1.31, 1.29 is the oldest still receiving patch fixes. Plan upgrades before you fall off that cliff—security CVEs do not wait for your change window.
Kubernetes N-2 support policy
- N — current minor (e.g. 1.31): full patch support
- N-1 — previous minor: patch support
- N-2 — oldest supported; end-of-life after next release ships
- Skip at most one minor per hop (1.28 → 1.30 requires 1.29 intermediate on many distros)
- API removals follow deprecation warnings—run pluto detect against your manifests before upgrading
Control plane first, then workers
The API server must understand objects from newer kubelets, but older kubelets cannot talk to a too-new API. Standard order: etcd → API server / controllers → kubelet/kube-proxy on workers. During skew, API server may be one minor ahead of kubelet (documented skew policy); never the reverse.
flowchart LR P0["Pre-flight\npluto + PDB + capacity"] P1["etcd members\nrolling restart"] P2["Control plane\nAPI / scheduler / CM"] P3["Worker nodes\ncordon → drain → upgrade"] P4["Post-verify\nAPI health + workloads"] P0 --> P1 --> P2 --> P3 --> P4
OpenShift: oc adm upgrade — CVO + MCO
OpenShift automates upgrades via the Cluster Version Operator (CVO) for the control plane and the Machine Config Operator (MCO) for node OS + kubelet. You declare a target channel (stable-4.16); CVO orchestrates operators; MCO reboots nodes in sequence.
- oc adm upgrade — show available updates and history
- oc adm upgrade --to=4.16.12 — pin target release
- Monitor clusterversion and machineconfigpool Degraded conditions
- Paused pools (spec.paused: true) block worker upgrades—intentional for canary pools
In-place vs blue-green cluster upgrade
| Strategy | How it works | When to use |
|---|---|---|
| In-place | Upgrade existing control plane and roll workers one-by-one (cordon/drain) | Default for most clusters; lower cost; requires PDB headroom and maintenance window |
| Blue-green | Provision new cluster at target version; migrate workloads via GitOps / Velero / traffic cutover | Large version jumps, risky CRD migrations, or zero-downtime mandate when in-place skew is unsafe |
Pre-upgrade checklist
| Check | Tool / command | Pass criteria |
|---|---|---|
| Deprecated APIs | pluto detect-files -d manifests/ | Zero removed API versions in target release |
| PDB capacity | kubectl get pdb -A | Each PDB allows at least one disruption; no minAvailable: 100% blocking drains |
| Surge capacity | Cluster autoscaler max nodes + spare capacity | Can drain one node per pool without pending pods |
| etcd health | etcdctl endpoint health | All members healthy; DB size < 75% quota |
| Backup | etcd snapshot + Velero full backup | Restore drill completed in last 90 days |
| Addon compatibility | CSI, CNI, ingress, monitoring operator versions | Vendor matrix confirms target K8s/OCP version |
$ pluto detect-files -d k8s/ --target-versions k8s=v1.31.0 $ kubectl get pdb -A -o wide $ kubectl cordon worker-3 $ kubectl drain worker-3 --ignore-daemonsets --delete-emptydir-data --grace-period=120 $ # upgrade kubelet on node, then: $ kubectl uncordon worker-3 $ kubectl get nodes -o custom-columns=NAME:.metadata.name,VERSION:.status.nodeInfo.kubeletVersion$ oc adm upgrade Cluster version is 4.15.20 Updates available: 4.15.21, 4.16.12 $ oc adm upgrade --to=4.16.12 --allow-explicit-upgrade $ oc get clusterversion version -o yaml | grep -A2 conditions $ oc get mcp -o wide NAME CONFIG UPDATED UPDATING DEGRADED worker rendered-worker-… True False False $ oc adm drain worker-3 --force --delete-emptydir-data --grace-period=120
Draining control plane nodes on self-managed clusters requires etcd member replacement knowledge—never drain all masters simultaneously. On OpenShift, masters are cordoned automatically; do not manually drain them.
CVO upgrades run OLM operators in dependency order. If an operator is pinned to an incompatible CSV, the upgrade stalls at Progressing=False. Check oc get clusteroperators for the first Degraded component before blaming MCO.
In-place vs blue-green: In-place is cheaper and simpler but couples blast radius to one etcd. Blue-green doubles infra cost temporarily but gives instant rollback (flip DNS back) and clean API migrations.
Disaster Recovery
Kubernetes state lives in etcd (cluster objects) and external systems (PV data, container images, secrets in Vault). DR strategy must cover both: restore the API objects and reattach durable data. Define RPO/RTO before picking tools.
RTO vs RPO
| Metric | Definition | Typical target |
|---|---|---|
| RPO (Recovery Point Objective) | Maximum acceptable data loss window | etcd hourly snapshots → up to 1h object loss; Velero every 6h → workload config loss bounded by schedule |
| RTO (Recovery Time Objective) | Maximum acceptable downtime to restore service | Single-region etcd restore: 30–90 min; full cluster rebuild + GitOps: 2–4h; multi-region active-active: minutes (traffic shift) |
etcd snapshot: save procedure
etcd snapshots capture a point-in-time copy of all Kubernetes objects. Run on a healthy member; store off-cluster (S3, GCS) with encryption at rest. Automate via CronJob on control plane or host cron.
$ ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M).db \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key Snapshot saved at /backup/etcd-20260605-0300.db $ etcdctl snapshot status /backup/etcd-20260605-0300.db --write-out=table $ aws s3 cp /backup/etcd-20260605-0300.db s3://dr-bucket/etcd/$ oc adm etcd-snapshot-backup --name=pre-upgrade-$(date +%Y%m%d) Snapshot saved to /home/core/assets/backup/pre-upgrade-20260605 $ oc debug node/master-0 -- chroot /host /usr/local/bin/cluster-backup.sh /var/log/etcd-backup $ # copy snapshot off-node before disaster strikes $ scp core@master-0:/home/core/assets/backup/*.tar.gz dr-jump:/backups/
etcd snapshot: restore procedure
Restore is destructive—stop API server, restore snapshot into a fresh etcd data dir, rebuild member list if IPs changed, restart etcd then control plane. Practice on a staging cluster quarterly.
$ # stop kube-apiserver and etcd on all control plane nodes first $ ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-20260605-0300.db \ --data-dir=/var/lib/etcd-restored \ --name=etcd-0 \ --initial-cluster=etcd-0=https://10.0.0.5:2380 \ --initial-advertise-peer-urls=https://10.0.0.5:2380 \ --initial-cluster-token=etcd-cluster-1 $ # point etcd static pod at /var/lib/etcd-restored; start etcd → apiserver → controllers $ kubectl get nodes # nodes may show NotReady until kubelets re-register — expected$ # OCP etcd restore is documented per version — typically rebuild cluster from backup $ oc adm etcd-recovery --help $ # for full cluster loss: install new OCP + restore etcd snapshot per RH docs
Velero backup and restore
Velero backs up Kubernetes resources (and optionally PV snapshots via CSI) to object storage. Use for namespace-level DR, migration, and complementing etcd (which does not include PV bytes on its own).
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: nightly-full
namespace: velero
spec:
schedule: "0 2 * * *" # 02:00 UTC daily
template:
ttl: 720h0m0s # retain 30 days
includedNamespaces:
- "*"
excludedNamespaces:
- kube-system
- velero
includeClusterResources: true
snapshotVolumes: true
storageLocation: default
volumeSnapshotLocations:
- default
hooks:
resources:
- name: pre-backup-hook
includedNamespaces:
- team-payments
pre:
- exec:
container: payments-api
command:
- /bin/sh
- -c
- "curl -sf localhost:8080/admin/flush-cache || true"
onError: Continue
timeout: 30s
---
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: pre-upgrade
namespace: velero
spec:
schedule: "@every 168h" # weekly; also trigger manually before upgrades
template:
ttl: 2160h0m0s # 90 days — long retention for rollback
includedNamespaces:
- "*"
snapshotVolumes: true
labelSelector:
matchLabels:
backup-tier: critical
$ velero backup create pre-upgrade-$(date +%Y%m%d) --include-namespaces '*' --snapshot-volumes $ velero backup describe pre-upgrade-20260605 --details $ velero schedule create nightly --schedule="0 2 * * *" --ttl 720h0m0s $ velero restore create --from-backup pre-upgrade-20260605 --include-namespaces team-payments,team-orders $ velero restore describe restore-20260605-143022$ oc get backup -n openshift-adp $ oc get backupstoragelocation -n openshift-adp $ # OCP Data Protection uses OADP operator (Velero-based) $ velero backup create ocp-critical --include-cluster-resources=true
Multi-region active-active complexity
Running the same workload in two regions simultaneously sounds like instant DR—but stateful tiers explode complexity:
- etcd is not multi-region — each cluster has its own control plane; no shared Kubernetes API across regions
- Data consistency — databases need active-active replication (CockroachDB, Aurora Global) or accept async lag
- Config drift — GitOps (Argo CD ApplicationSet) must keep both clusters identical; ACM policies enforce guardrails
- Traffic steering — global DNS/LB health checks; session affinity breaks without shared Redis
- Blast radius — a bad CRD upgrade takes down both if deployed simultaneously—use canary clusters or staged waves
etcd snapshots are consistent at the Raft log level—they include all API objects at one revision. They do not include container images in registry or bytes inside EBS volumes. Velero volume snapshots delegate to CSI/cloud APIs; restore requires matching StorageClass and AZ topology.
Most teams run active-passive for the data plane (primary DB + async replica) and active-active only for stateless tiers behind a global LB. Full active-active is reserved for read-heavy, partition-tolerant workloads—not general microservices.
Encrypt etcd snapshots and Velero backups at rest (SSE-KMS, CMEK). Snapshots contain all Secrets in plaintext unless encryption-at-rest is enabled in etcd. Restrict S3 bucket IAM to break-glass roles only.
Cost Management
Kubernetes makes scaling easy—which makes overspending easy. Production cost control combines observability (who uses what), automation (right-size requests), and infrastructure choices (spot, pool topology).
Right-sizing with VPA
The Vertical Pod Autoscaler recommends or mutates CPU/memory requests based on historical usage. Start in updateMode: "Off" (recommendations only) before enabling auto mode on stateless Deployments.
- VPA conflicts with HPA on the same CPU metric—use VPA for memory, HPA for custom metrics, or separate workloads
- Eviction-based updates cause brief pod restarts—pair with PDB and surge capacity
- Check recommendations: kubectl describe vpa payments-api-vpa
Spot / preemptible nodes
Spot instances cut compute cost 60–90% but can disappear with two minutes notice. Use for:
- Fault-tolerant batch Jobs and CI runners
- Stateless web tiers with ≥3 replicas and PodDisruptionBudget
- Dedicated node pools with taints (spot=true:NoSchedule)—never mix with stateful Sets without topology awareness
Cluster autoscaler
Scales node groups when pods are unschedulable due to insufficient CPU/memory. Tune:
- --scale-down-utilization-threshold — default 0.5; raise to avoid aggressive scale-down thrash
- --skip-nodes-with-local-storage — protect nodes with emptyDir you care about
- Min/max node group sizes — cap blast radius and monthly bill ceiling
Node pool strategy
| Pool | Instance type | Workload fit |
|---|---|---|
| baseline | On-demand, balanced (m6i.2xlarge) | Production APIs, databases, monitoring |
| burst | Spot, same arch as baseline | Horizontally scaled stateless services |
| compute | CPU-optimized (c6i) | Transcoding, ML inference batch |
| memory | Memory-optimized (r6i) | Caching, JVM heaps, in-memory analytics |
| system | Small on-demand, tainted | Ingress, cluster-autoscaler, monitoring agents only |
OpenShift cost management
- OpenShift Cost Management (based on OpenCost/Kubecost) — cost per namespace, label, node pool
- Chargeback reports via Prometheus metrics + cloud billing integration
- MachineSets map to AWS/GCP machine types—right-size per pool in MachineSet CR
- Cluster autoscaling via Cluster Autoscaler Operator on MachineAutoscaler CRs
- Optimize OCP footprint: reduce over-provisioned requests; use LimitRanges to cap namespace totals
$ kubectl top pods -A --sort-by=memory | head -20 $ kubectl get pods -A -o json | jq '[.items[] | {ns:.metadata.namespace, pod:.metadata.name, cpu:(.spec.containers|map(.resources.requests.cpu//"0")|join(","))}]' $ kubectl describe vpa -n team-payments $ kubectl get nodes -o custom-columns=NAME:.metadata.name,CPU:.status.allocatable.cpu,MEM:.status.allocatable.memory,TAINTS:.spec.taints$ oc adm top pods -A --sort-by=memory | head -20 $ oc get machineset -n openshift-machine-api $ oc get machineautoscaler -n openshift-machine-api $ # Cost Management UI: OpenShift Console → Observability → Cost Management
Set namespace ResourceQuota and LimitRange defaults so new Deployments cannot request 8 CPU by copy-paste. Example: default request 100m CPU / 128Mi per container, max 2 CPU / 4Gi per pod.
Compare kubectl top usage against requests weekly. Pods using 50m CPU but requesting 2 cores are burning scheduler slots and node capacity—VPA Off mode surfaces these without risk.
Common Production Issues & Debugging
Pod status strings are symptoms, not diagnoses. Use a consistent triage order: get pods → describe → logs --previous → get events. The table below maps each status to likely cause and fix—click through to dedicated subsections for deep dives.
Pod status troubleshooting matrix
| Status / signal | Likely cause | First commands | Fix direction |
|---|---|---|---|
| ImagePullBackOff / ErrImagePull | Wrong image tag, missing pull secret, registry auth expired, rate limit, private registry DNS | describe pod, get events | Fix image name; add imagePullSecrets; verify SA; test crictl pull on node |
| CrashLoopBackOff | App exits non-zero: config error, missing env/secret, failed probe, dependency down | logs --previous, describe pod | Fix app config; relax startup probe; check command/args; verify mounted secrets exist |
| Pending | Insufficient CPU/memory, no matching node selector/taints, PVC unbound, affinity rules, quota exceeded | describe pod (Events), get pvc | Right-size requests; add nodes; fix PVC/storage class; adjust affinity or taints/tolerations |
| OOMKilled (exit 137) | Container exceeded memory limit; JVM/Node heap too large for limit | describe pod, top pod | Raise memory limit + request; fix leak; use VPA; align JVM -Xmx to ~75% of limit |
| Evicted | Node DiskPressure, MemoryPressure, PIDPressure; pod exceeded ephemeral storage limit | describe node, get pods -A --field-selector status.phase=Failed | Free disk on node; prune images; raise ephemeral-storage limit; drain problematic node |
| Service not reachable | Wrong selector, pods not Ready, NetworkPolicy block, missing Endpoints, kube-proxy/CNI issue, wrong port | get endpoints, get netpol, port-forward test | Align Service selector with pod labels; fix readiness probe; open NetworkPolicy ingress |
| SCC violation (OpenShift) | Pod requests capabilities or UID forbidden by assigned Security Context Constraint | describe pod, get scc | Use restricted-v2 compatible securityContext; create custom SCC + RoleBinding if justified |
ImagePullBackOff
The kubelet cannot pull the container image. The backoff timer increases exponentially between retries. Check Events on the Pod first—the message distinguishes not found from unauthorized.
- Cause: Typo in image tag; image deleted from registry; imagePullSecrets missing or wrong; dockerconfig expired; registry throttling
- Fix: Verify image exists with docker pull or crictl pull; attach secret to Pod or ServiceAccount; on OCP use oc secrets link
:latest tags that worked yesterday may have been overwritten or removed from registry. Pin immutable digests or semver tags in production manifests.
CrashLoopBackOff
Container starts, crashes, restarts. Kubernetes backs off restart attempts. The answer is almost always in kubectl logs <pod> --previous—not the current (empty) log stream.
- Cause: Application panic/exception; wrong entrypoint; missing ConfigMap key; liveness probe killing slow-start JVM; file permission on mounted volume
- Fix: Fix root exception; add startupProbe with long failureThreshold; run kubectl debug with netshoot to test dependencies
CrashLoopBackOff is a kubelet-level status, not a container exit code. The container may show Terminated with Reason: Error and exit code 1—the BackOff is the delay before the next restart attempt.
Pending pods
Scheduler could not place the pod. The describe Events section names the exact filter that failed: 0/12 nodes available: 3 Insufficient memory, 2 node(s) didn't match Pod's node affinity.
- Cause: Resource requests exceed free capacity; PVC Pending; nodeSelector too narrow; taints without tolerations; ResourceQuota exceeded
- Fix: Lower requests or add nodes via cluster autoscaler; fix StorageClass/PVC; relax affinity; check kubectl describe quota
OOMKilled
Linux OOM killer terminated the container when it exceeded its cgroup memory limit (not the node—unless limit unset). Exit code 137 (128 + SIGKILL 9).
- Cause: Memory limit too low vs working set; memory leak; JVM heap larger than container limit; sidecar memory not accounted
- Fix: Increase limits; profile heap; set resources.requests.memory close to steady-state for scheduling; use VPA recommendations
Java services commonly OOMKill after deploy when -Xmx equals the limit—off-heap metaspace, thread stacks, and native buffers need headroom. Rule of thumb: -Xmx = 70–75% of container memory limit.
Evicted pods
kubelet evicts pods when a node hits DiskPressure, MemoryPressure, or PIDPressure to protect the node. Evicted pods remain visible with status.reason: Evicted.
- Cause: Node disk full (container logs, unused images); memory overcommit; too many processes; exceeded ephemeral-storage limit
- Fix: Prune images (crictl rmi --prune); rotate logs; add disk; set PriorityClass so critical pods evict last; delete evicted pod objects so controllers recreate
Service not reachable
Traffic to ClusterIP or Route fails while pods appear Running. Work through the path: Service → Endpoints → Pod readiness → NetworkPolicy → CNI/kube-proxy.
- Cause: Service selector mismatch (labels changed in Deployment template); readiness probe failing (pod Running but not Ready); NetworkPolicy denies ingress; targetPort ≠ containerPort; Istio mTLS STRICT without sidecar
- Fix: kubectl get endpoints svc/name—empty means selector mismatch or no Ready pods; port-forward to pod bypasses Service to isolate layer
Run kubectl run tmp --rm -it --image=nicolaka/netshoot -- curl -v http://svc.ns.svc:8080/health from the same namespace to test ClusterIP DNS and connectivity in one step.
SCC violations (OpenShift)
OpenShift assigns a Security Context Constraint to each pod based on ServiceAccount. The default restricted-v2 forbids privileged mode, hostPath, and arbitrary UID 0.
- Cause: runAsUser: 0; privileged: true; hostNetwork: true; capabilities.add: ["SYS_ADMIN"] not allowed by SCC
- Fix: Remove forbidden fields; use runAsNonRoot: true and allocated UID range; if truly needed, create minimal custom SCC and bind with oc adm policy add-scc-to-user
Pod creation failure with unable to validate against any security context constraint means no SCC matched. Check oc get scc and oc describe scc restricted-v2 for allowed volumes, UIDs, and capabilities.
$ kubectl get pods -n team-payments -o wide $ kubectl describe pod payments-api-7d4f8b-abc12 -n team-payments $ kubectl logs payments-api-7d4f8b-abc12 -n team-payments --previous --tail=100 $ kubectl get events -n team-payments --sort-by='.lastTimestamp' | tail -15 $ kubectl get endpoints payments-api -n team-payments $ kubectl get netpol -n team-payments $ kubectl top pod -n team-payments$ oc get pods -n team-payments -o wide $ oc describe pod payments-api-7d4f8b-abc12 -n team-payments $ oc logs payments-api-7d4f8b-abc12 -n team-payments --previous $ oc get events -n team-payments --sort-by='.lastTimestamp' $ oc get scc | grep -E 'restricted|anyuid|privileged' $ oc adm policy who-can use scc restricted-v2 $ oc get route payments-api -n team-payments -o yaml
Walk through ImagePullBackOff vs Pending vs CrashLoopBackOff without pausing: pull happens at kubelet (after schedule), Pending is scheduler, CrashLoop is post-start. Mention you always check Events and --previous logs first.
Capacity Planning
Under-provisioned control planes manifest as API latency and etcd timeouts; over-provisioned workers manifest as cloud invoices. Plan for headroom—especially during node drains, surge deploys, and etcd compaction.
Node sizing
Size worker nodes for your pod density and largest single-pod footprint. Too many small nodes increase kubelet/API overhead; too few large nodes increase blast radius per drain.
- Max pods per node: default 110—requires sufficient ENIs/IPs (AWS) or CNI tuning
- Reserve ~10% node CPU/memory for system daemons (kubelet, CNI, monitoring)
- Keep largest pod request < 50% node allocatable to allow bin-packing of smaller pods
Cluster autoscaler headroom (10–20%)
Maintain spare schedulable capacity so a single node drain or spot interruption does not leave pods Pending. Target 10–20% free allocatable CPU/memory across the worker pool at steady state. Set maxNodes high enough for peak traffic but cap with budgets.
etcd requirements
- Odd member count — 3 for most clusters; 5 for large/multi-AZ production
- Disk — low-latency SSD (NVMe); 8 GB minimum, scale with object count; keep DB < 75% of quota
- Memory — 8 GB baseline; 16–32 GB for >1000 nodes or heavy CRD churn
- Network — <10ms latency between members; never span etcd across regions
- Monitor: etcd_mvcc_db_total_size_in_bytes, etcd_server_has_leader
Control plane sizing table
| Cluster scale | Nodes | API server | etcd | Controller manager |
|---|---|---|---|---|
| Small | < 50 | 2–4 CPU, 8 GB RAM | 3 × (2 CPU, 8 GB), 20 GB SSD | 2 CPU, 4 GB |
| Medium | 50–250 | 4–8 CPU, 16 GB RAM | 3 × (4 CPU, 16 GB), 50 GB SSD | 4 CPU, 8 GB |
| Large | 250–1000 | 8+ CPU, 32 GB RAM, HA load balancer | 5 × (8 CPU, 32 GB), 100 GB NVMe | 8 CPU, 16 GB |
| Very large | 1000+ | Multiple API server instances behind LB; tune --max-requests-inflight | 5 × (16 CPU, 64 GB); dedicated nodes; defrag schedule | Shard controllers; consider KCM leader election tuning |
API server 429 throttling
When request rate exceeds --max-requests-inflight (default 400) or --max-mutating-requests-inflight (200), the API server returns HTTP 429 Too Many Requests. Clients should retry with exponential backoff.
- Symptoms: Controllers flap; kubectl timeouts; Prometheus scrape failures; CI pipelines fail apply
- Common causes: Runaway controller loops; excessive watch reconnects; unbounded kubectl get pods -A -w; missing API caching in operators
- Mitigations: Raise inflight limits cautiously; add API server replicas (HA); use resourceVersion caching; paginate LIST calls; apply client-side rate limits; split workload across multiple clusters
$ kubectl top nodes $ kubectl describe nodes | grep -A5 "Allocated resources" $ kubectl get --raw /metrics | grep apiserver_request_total | grep 429 $ kubectl get --raw /metrics | grep etcd_mvcc_db_total_size_in_bytes $ # request count by verb — watch for LIST storms $ kubectl get --raw /metrics | grep 'apiserver_request_total{verb="LIST"' | head$ oc adm top nodes $ oc get nodes -o json | jq '[.items[] | {name:.metadata.name, cpu:.status.allocatable.cpu, mem:.status.allocatable.memory, pods:.status.allocatable.pods}]' $ oc get clusteroperator etcd -o yaml $ # OCP monitoring: openshift-etcd-operator alerts on DB size and fsync latency $ oc exec -n openshift-etcd etcd-master-0 -c etcdctl -- etcdctl endpoint status -w table
flowchart TB
subgraph pool["Worker pool allocatable"]
USED["Scheduled pods\n~80-90%"]
HEAD["Headroom\n10-20% free"]
DRAIN["One node drain\nfits in headroom"]
end
USED --> HEAD
HEAD --> DRAIN
CA["Cluster Autoscaler"] -->|"pods Pending"| HEAD
Large nodes vs many small nodes: Fewer larger nodes reduce per-node overhead but each drain/eviction displaces more pods. Many smaller nodes improve granularity but increase IP consumption and kubelet API chatter. Match node size to your P95 pod footprint plus headroom.