Production Operations & Cluster Management

Cluster Upgrade Strategy

Kubernetes releases three minor versions per year. The upstream support window is N-2: if the latest is 1.31, 1.29 is the oldest still receiving patch fixes. Plan upgrades before you fall off that cliff—security CVEs do not wait for your change window.

Kubernetes N-2 support policy

N — current minor (e.g. 1.31): full patch support
N-1 — previous minor: patch support
N-2 — oldest supported; end-of-life after next release ships
Skip at most one minor per hop (1.28 → 1.30 requires 1.29 intermediate on many distros)
API removals follow deprecation warnings—run pluto detect against your manifests before upgrading

Control plane first, then workers

The API server must understand objects from newer kubelets, but older kubelets cannot talk to a too-new API. Standard order: etcd → API server / controllers → kubelet/kube-proxy on workers. During skew, API server may be one minor ahead of kubelet (documented skew policy); never the reverse.

flowchart LR
  P0["Pre-flight\npluto + PDB + capacity"]
  P1["etcd members\nrolling restart"]
  P2["Control plane\nAPI / scheduler / CM"]
  P3["Worker nodes\ncordon → drain → upgrade"]
  P4["Post-verify\nAPI health + workloads"]
  P0 --> P1 --> P2 --> P3 --> P4

OpenShift: oc adm upgrade — CVO + MCO

OpenShift automates upgrades via the Cluster Version Operator (CVO) for the control plane and the Machine Config Operator (MCO) for node OS + kubelet. You declare a target channel (stable-4.16); CVO orchestrates operators; MCO reboots nodes in sequence.

oc adm upgrade — show available updates and history
oc adm upgrade --to=4.16.12 — pin target release
Monitor clusterversion and machineconfigpool Degraded conditions
Paused pools (spec.paused: true) block worker upgrades—intentional for canary pools

In-place vs blue-green cluster upgrade

Strategy	How it works	When to use
In-place	Upgrade existing control plane and roll workers one-by-one (cordon/drain)	Default for most clusters; lower cost; requires PDB headroom and maintenance window
Blue-green	Provision new cluster at target version; migrate workloads via GitOps / Velero / traffic cutover	Large version jumps, risky CRD migrations, or zero-downtime mandate when in-place skew is unsafe

Pre-upgrade checklist

Check	Tool / command	Pass criteria
Deprecated APIs	pluto detect-files -d manifests/	Zero removed API versions in target release
PDB capacity	kubectl get pdb -A	Each PDB allows at least one disruption; no minAvailable: 100% blocking drains
Surge capacity	Cluster autoscaler max nodes + spare capacity	Can drain one node per pool without pending pods
etcd health	etcdctl endpoint health	All members healthy; DB size < 75% quota
Backup	etcd snapshot + Velero full backup	Restore drill completed in last 90 days
Addon compatibility	CSI, CNI, ingress, monitoring operator versions	Vendor matrix confirms target K8s/OCP version

$ pluto detect-files -d k8s/ --target-versions k8s=v1.31.0
$ kubectl get pdb -A -o wide
$ kubectl cordon worker-3
$ kubectl drain worker-3 --ignore-daemonsets --delete-emptydir-data --grace-period=120
$ # upgrade kubelet on node, then:
$ kubectl uncordon worker-3
$ kubectl get nodes -o custom-columns=NAME:.metadata.name,VERSION:.status.nodeInfo.kubeletVersion$ oc adm upgrade
Cluster version is 4.15.20
Updates available: 4.15.21, 4.16.12
$ oc adm upgrade --to=4.16.12 --allow-explicit-upgrade
$ oc get clusterversion version -o yaml | grep -A2 conditions
$ oc get mcp -o wide
NAME     CONFIG   UPDATED   UPDATING   DEGRADED
worker   rendered-worker-…   True   False   False
$ oc adm drain worker-3 --force --delete-emptydir-data --grace-period=120

⚠️ Pitfall

Draining control plane nodes on self-managed clusters requires etcd member replacement knowledge—never drain all masters simultaneously. On OpenShift, masters are cordoned automatically; do not manually drain them.

🔴 OpenShift

CVO upgrades run OLM operators in dependency order. If an operator is pinned to an incompatible CSV, the upgrade stalls at Progressing=False. Check oc get clusteroperators for the first Degraded component before blaming MCO.

⚖️ Trade-off

In-place vs blue-green: In-place is cheaper and simpler but couples blast radius to one etcd. Blue-green doubles infra cost temporarily but gives instant rollback (flip DNS back) and clean API migrations.

Disaster Recovery

Kubernetes state lives in etcd (cluster objects) and external systems (PV data, container images, secrets in Vault). DR strategy must cover both: restore the API objects and reattach durable data. Define RPO/RTO before picking tools.

RTO vs RPO

Metric	Definition	Typical target
RPO (Recovery Point Objective)	Maximum acceptable data loss window	etcd hourly snapshots → up to 1h object loss; Velero every 6h → workload config loss bounded by schedule
RTO (Recovery Time Objective)	Maximum acceptable downtime to restore service	Single-region etcd restore: 30–90 min; full cluster rebuild + GitOps: 2–4h; multi-region active-active: minutes (traffic shift)

etcd snapshot: save procedure

etcd snapshots capture a point-in-time copy of all Kubernetes objects. Run on a healthy member; store off-cluster (S3, GCS) with encryption at rest. Automate via CronJob on control plane or host cron.

$ ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key
Snapshot saved at /backup/etcd-20260605-0300.db
$ etcdctl snapshot status /backup/etcd-20260605-0300.db --write-out=table
$ aws s3 cp /backup/etcd-20260605-0300.db s3://dr-bucket/etcd/$ oc adm etcd-snapshot-backup --name=pre-upgrade-$(date +%Y%m%d)
Snapshot saved to /home/core/assets/backup/pre-upgrade-20260605
$ oc debug node/master-0 -- chroot /host /usr/local/bin/cluster-backup.sh /var/log/etcd-backup
$ # copy snapshot off-node before disaster strikes
$ scp core@master-0:/home/core/assets/backup/*.tar.gz dr-jump:/backups/

etcd snapshot: restore procedure

Restore is destructive—stop API server, restore snapshot into a fresh etcd data dir, rebuild member list if IPs changed, restart etcd then control plane. Practice on a staging cluster quarterly.

$ # stop kube-apiserver and etcd on all control plane nodes first
$ ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-20260605-0300.db \
  --data-dir=/var/lib/etcd-restored \
  --name=etcd-0 \
  --initial-cluster=etcd-0=https://10.0.0.5:2380 \
  --initial-advertise-peer-urls=https://10.0.0.5:2380 \
  --initial-cluster-token=etcd-cluster-1
$ # point etcd static pod at /var/lib/etcd-restored; start etcd → apiserver → controllers
$ kubectl get nodes
# nodes may show NotReady until kubelets re-register — expected$ # OCP etcd restore is documented per version — typically rebuild cluster from backup
$ oc adm etcd-recovery --help
$ # for full cluster loss: install new OCP + restore etcd snapshot per RH docs

Velero backup and restore

Velero backs up Kubernetes resources (and optionally PV snapshots via CSI) to object storage. Use for namespace-level DR, migration, and complementing etcd (which does not include PV bytes on its own).

apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: nightly-full
  namespace: velero
spec:
  schedule: "0 2 * * *"          # 02:00 UTC daily
  template:
    ttl: 720h0m0s                # retain 30 days
    includedNamespaces:
      - "*"
    excludedNamespaces:
      - kube-system
      - velero
    includeClusterResources: true
    snapshotVolumes: true
    storageLocation: default
    volumeSnapshotLocations:
      - default
    hooks:
      resources:
        - name: pre-backup-hook
          includedNamespaces:
            - team-payments
          pre:
            - exec:
                container: payments-api
                command:
                  - /bin/sh
                  - -c
                  - "curl -sf localhost:8080/admin/flush-cache || true"
                onError: Continue
                timeout: 30s
---
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: pre-upgrade
  namespace: velero
spec:
  schedule: "@every 168h"        # weekly; also trigger manually before upgrades
  template:
    ttl: 2160h0m0s              # 90 days — long retention for rollback
    includedNamespaces:
      - "*"
    snapshotVolumes: true
    labelSelector:
      matchLabels:
        backup-tier: critical

$ velero backup create pre-upgrade-$(date +%Y%m%d) --include-namespaces '*' --snapshot-volumes
$ velero backup describe pre-upgrade-20260605 --details
$ velero schedule create nightly --schedule="0 2 * * *" --ttl 720h0m0s
$ velero restore create --from-backup pre-upgrade-20260605 --include-namespaces team-payments,team-orders
$ velero restore describe restore-20260605-143022$ oc get backup -n openshift-adp
$ oc get backupstoragelocation -n openshift-adp
$ # OCP Data Protection uses OADP operator (Velero-based)
$ velero backup create ocp-critical --include-cluster-resources=true

Multi-region active-active complexity

Running the same workload in two regions simultaneously sounds like instant DR—but stateful tiers explode complexity:

etcd is not multi-region — each cluster has its own control plane; no shared Kubernetes API across regions
Data consistency — databases need active-active replication (CockroachDB, Aurora Global) or accept async lag
Config drift — GitOps (Argo CD ApplicationSet) must keep both clusters identical; ACM policies enforce guardrails
Traffic steering — global DNS/LB health checks; session affinity breaks without shared Redis
Blast radius — a bad CRD upgrade takes down both if deployed simultaneously—use canary clusters or staged waves

🔬 Under the Hood

etcd snapshots are consistent at the Raft log level—they include all API objects at one revision. They do not include container images in registry or bytes inside EBS volumes. Velero volume snapshots delegate to CSI/cloud APIs; restore requires matching StorageClass and AZ topology.

📦 Real World

Most teams run active-passive for the data plane (primary DB + async replica) and active-active only for stateless tiers behind a global LB. Full active-active is reserved for read-heavy, partition-tolerant workloads—not general microservices.

🔒 Security

Encrypt etcd snapshots and Velero backups at rest (SSE-KMS, CMEK). Snapshots contain all Secrets in plaintext unless encryption-at-rest is enabled in etcd. Restrict S3 bucket IAM to break-glass roles only.

Cost Management

Kubernetes makes scaling easy—which makes overspending easy. Production cost control combines observability (who uses what), automation (right-size requests), and infrastructure choices (spot, pool topology).

Right-sizing with VPA

The Vertical Pod Autoscaler recommends or mutates CPU/memory requests based on historical usage. Start in updateMode: "Off" (recommendations only) before enabling auto mode on stateless Deployments.

VPA conflicts with HPA on the same CPU metric—use VPA for memory, HPA for custom metrics, or separate workloads
Eviction-based updates cause brief pod restarts—pair with PDB and surge capacity
Check recommendations: kubectl describe vpa payments-api-vpa

Spot / preemptible nodes

Spot instances cut compute cost 60–90% but can disappear with two minutes notice. Use for:

Fault-tolerant batch Jobs and CI runners
Stateless web tiers with ≥3 replicas and PodDisruptionBudget
Dedicated node pools with taints (spot=true:NoSchedule)—never mix with stateful Sets without topology awareness

Cluster autoscaler

Scales node groups when pods are unschedulable due to insufficient CPU/memory. Tune:

--scale-down-utilization-threshold — default 0.5; raise to avoid aggressive scale-down thrash
--skip-nodes-with-local-storage — protect nodes with emptyDir you care about
Min/max node group sizes — cap blast radius and monthly bill ceiling

Node pool strategy

Pool	Instance type	Workload fit
baseline	On-demand, balanced (m6i.2xlarge)	Production APIs, databases, monitoring
burst	Spot, same arch as baseline	Horizontally scaled stateless services
compute	CPU-optimized (c6i)	Transcoding, ML inference batch
memory	Memory-optimized (r6i)	Caching, JVM heaps, in-memory analytics
system	Small on-demand, tainted	Ingress, cluster-autoscaler, monitoring agents only

OpenShift cost management

OpenShift Cost Management (based on OpenCost/Kubecost) — cost per namespace, label, node pool
Chargeback reports via Prometheus metrics + cloud billing integration
MachineSets map to AWS/GCP machine types—right-size per pool in MachineSet CR
Cluster autoscaling via Cluster Autoscaler Operator on MachineAutoscaler CRs
Optimize OCP footprint: reduce over-provisioned requests; use LimitRanges to cap namespace totals

$ kubectl top pods -A --sort-by=memory | head -20
$ kubectl get pods -A -o json | jq '[.items[] | {ns:.metadata.namespace, pod:.metadata.name, cpu:(.spec.containers|map(.resources.requests.cpu//"0")|join(","))}]'
$ kubectl describe vpa -n team-payments
$ kubectl get nodes -o custom-columns=NAME:.metadata.name,CPU:.status.allocatable.cpu,MEM:.status.allocatable.memory,TAINTS:.spec.taints$ oc adm top pods -A --sort-by=memory | head -20
$ oc get machineset -n openshift-machine-api
$ oc get machineautoscaler -n openshift-machine-api
$ # Cost Management UI: OpenShift Console → Observability → Cost Management

⚙️ Config

Set namespace ResourceQuota and LimitRange defaults so new Deployments cannot request 8 CPU by copy-paste. Example: default request 100m CPU / 128Mi per container, max 2 CPU / 4Gi per pod.

💡 Pro Tip

Compare kubectl top usage against requests weekly. Pods using 50m CPU but requesting 2 cores are burning scheduler slots and node capacity—VPA Off mode surfaces these without risk.

Common Production Issues & Debugging

Pod status strings are symptoms, not diagnoses. Use a consistent triage order: get pods → describe → logs --previous → get events. The table below maps each status to likely cause and fix—click through to dedicated subsections for deep dives.

Pod status troubleshooting matrix

Status / signal	Likely cause	First commands	Fix direction
ImagePullBackOff / ErrImagePull	Wrong image tag, missing pull secret, registry auth expired, rate limit, private registry DNS	describe pod, get events	Fix image name; add imagePullSecrets; verify SA; test crictl pull on node
CrashLoopBackOff	App exits non-zero: config error, missing env/secret, failed probe, dependency down	logs --previous, describe pod	Fix app config; relax startup probe; check command/args; verify mounted secrets exist
Pending	Insufficient CPU/memory, no matching node selector/taints, PVC unbound, affinity rules, quota exceeded	describe pod (Events), get pvc	Right-size requests; add nodes; fix PVC/storage class; adjust affinity or taints/tolerations
OOMKilled (exit 137)	Container exceeded memory limit; JVM/Node heap too large for limit	describe pod, top pod	Raise memory limit + request; fix leak; use VPA; align JVM -Xmx to ~75% of limit
Evicted	Node DiskPressure, MemoryPressure, PIDPressure; pod exceeded ephemeral storage limit	describe node, get pods -A --field-selector status.phase=Failed	Free disk on node; prune images; raise ephemeral-storage limit; drain problematic node
Service not reachable	Wrong selector, pods not Ready, NetworkPolicy block, missing Endpoints, kube-proxy/CNI issue, wrong port	get endpoints, get netpol, port-forward test	Align Service selector with pod labels; fix readiness probe; open NetworkPolicy ingress
SCC violation (OpenShift)	Pod requests capabilities or UID forbidden by assigned Security Context Constraint	describe pod, get scc	Use restricted-v2 compatible securityContext; create custom SCC + RoleBinding if justified

ImagePullBackOff

The kubelet cannot pull the container image. The backoff timer increases exponentially between retries. Check Events on the Pod first—the message distinguishes not found from unauthorized.

Cause: Typo in image tag; image deleted from registry; imagePullSecrets missing or wrong; dockerconfig expired; registry throttling
Fix: Verify image exists with docker pull or crictl pull; attach secret to Pod or ServiceAccount; on OCP use oc secrets link

⚠️ Pitfall

:latest tags that worked yesterday may have been overwritten or removed from registry. Pin immutable digests or semver tags in production manifests.

CrashLoopBackOff

Container starts, crashes, restarts. Kubernetes backs off restart attempts. The answer is almost always in kubectl logs <pod> --previous—not the current (empty) log stream.

Cause: Application panic/exception; wrong entrypoint; missing ConfigMap key; liveness probe killing slow-start JVM; file permission on mounted volume
Fix: Fix root exception; add startupProbe with long failureThreshold; run kubectl debug with netshoot to test dependencies

🔬 Under the Hood

CrashLoopBackOff is a kubelet-level status, not a container exit code. The container may show Terminated with Reason: Error and exit code 1—the BackOff is the delay before the next restart attempt.

Pending pods

Scheduler could not place the pod. The describe Events section names the exact filter that failed: 0/12 nodes available: 3 Insufficient memory, 2 node(s) didn't match Pod's node affinity.

Cause: Resource requests exceed free capacity; PVC Pending; nodeSelector too narrow; taints without tolerations; ResourceQuota exceeded
Fix: Lower requests or add nodes via cluster autoscaler; fix StorageClass/PVC; relax affinity; check kubectl describe quota

OOMKilled

Linux OOM killer terminated the container when it exceeded its cgroup memory limit (not the node—unless limit unset). Exit code 137 (128 + SIGKILL 9).

Cause: Memory limit too low vs working set; memory leak; JVM heap larger than container limit; sidecar memory not accounted
Fix: Increase limits; profile heap; set resources.requests.memory close to steady-state for scheduling; use VPA recommendations

📦 Real World

Java services commonly OOMKill after deploy when -Xmx equals the limit—off-heap metaspace, thread stacks, and native buffers need headroom. Rule of thumb: -Xmx = 70–75% of container memory limit.

Evicted pods

kubelet evicts pods when a node hits DiskPressure, MemoryPressure, or PIDPressure to protect the node. Evicted pods remain visible with status.reason: Evicted.

Cause: Node disk full (container logs, unused images); memory overcommit; too many processes; exceeded ephemeral-storage limit
Fix: Prune images (crictl rmi --prune); rotate logs; add disk; set PriorityClass so critical pods evict last; delete evicted pod objects so controllers recreate

Service not reachable

Traffic to ClusterIP or Route fails while pods appear Running. Work through the path: Service → Endpoints → Pod readiness → NetworkPolicy → CNI/kube-proxy.

Cause: Service selector mismatch (labels changed in Deployment template); readiness probe failing (pod Running but not Ready); NetworkPolicy denies ingress; targetPort ≠ containerPort; Istio mTLS STRICT without sidecar
Fix: kubectl get endpoints svc/name—empty means selector mismatch or no Ready pods; port-forward to pod bypasses Service to isolate layer

💡 Pro Tip

Run kubectl run tmp --rm -it --image=nicolaka/netshoot -- curl -v http://svc.ns.svc:8080/health from the same namespace to test ClusterIP DNS and connectivity in one step.

SCC violations (OpenShift)

OpenShift assigns a Security Context Constraint to each pod based on ServiceAccount. The default restricted-v2 forbids privileged mode, hostPath, and arbitrary UID 0.

Cause: runAsUser: 0; privileged: true; hostNetwork: true; capabilities.add: ["SYS_ADMIN"] not allowed by SCC
Fix: Remove forbidden fields; use runAsNonRoot: true and allocated UID range; if truly needed, create minimal custom SCC and bind with oc adm policy add-scc-to-user

🔴 OpenShift

Pod creation failure with unable to validate against any security context constraint means no SCC matched. Check oc get scc and oc describe scc restricted-v2 for allowed volumes, UIDs, and capabilities.

$ kubectl get pods -n team-payments -o wide
$ kubectl describe pod payments-api-7d4f8b-abc12 -n team-payments
$ kubectl logs payments-api-7d4f8b-abc12 -n team-payments --previous --tail=100
$ kubectl get events -n team-payments --sort-by='.lastTimestamp' | tail -15
$ kubectl get endpoints payments-api -n team-payments
$ kubectl get netpol -n team-payments
$ kubectl top pod -n team-payments$ oc get pods -n team-payments -o wide
$ oc describe pod payments-api-7d4f8b-abc12 -n team-payments
$ oc logs payments-api-7d4f8b-abc12 -n team-payments --previous
$ oc get events -n team-payments --sort-by='.lastTimestamp'
$ oc get scc | grep -E 'restricted|anyuid|privileged'
$ oc adm policy who-can use scc restricted-v2
$ oc get route payments-api -n team-payments -o yaml

🎯 Interview Tip

Walk through ImagePullBackOff vs Pending vs CrashLoopBackOff without pausing: pull happens at kubelet (after schedule), Pending is scheduler, CrashLoop is post-start. Mention you always check Events and --previous logs first.

Capacity Planning

Under-provisioned control planes manifest as API latency and etcd timeouts; over-provisioned workers manifest as cloud invoices. Plan for headroom—especially during node drains, surge deploys, and etcd compaction.

Node sizing

Size worker nodes for your pod density and largest single-pod footprint. Too many small nodes increase kubelet/API overhead; too few large nodes increase blast radius per drain.

Max pods per node: default 110—requires sufficient ENIs/IPs (AWS) or CNI tuning
Reserve ~10% node CPU/memory for system daemons (kubelet, CNI, monitoring)
Keep largest pod request < 50% node allocatable to allow bin-packing of smaller pods

Cluster autoscaler headroom (10–20%)

Maintain spare schedulable capacity so a single node drain or spot interruption does not leave pods Pending. Target 10–20% free allocatable CPU/memory across the worker pool at steady state. Set maxNodes high enough for peak traffic but cap with budgets.

etcd requirements

Odd member count — 3 for most clusters; 5 for large/multi-AZ production
Disk — low-latency SSD (NVMe); 8 GB minimum, scale with object count; keep DB < 75% of quota
Memory — 8 GB baseline; 16–32 GB for >1000 nodes or heavy CRD churn
Network — <10ms latency between members; never span etcd across regions
Monitor: etcd_mvcc_db_total_size_in_bytes, etcd_server_has_leader

Control plane sizing table

Cluster scale	Nodes	API server	etcd	Controller manager
Small	< 50	2–4 CPU, 8 GB RAM	3 × (2 CPU, 8 GB), 20 GB SSD	2 CPU, 4 GB
Medium	50–250	4–8 CPU, 16 GB RAM	3 × (4 CPU, 16 GB), 50 GB SSD	4 CPU, 8 GB
Large	250–1000	8+ CPU, 32 GB RAM, HA load balancer	5 × (8 CPU, 32 GB), 100 GB NVMe	8 CPU, 16 GB
Very large	1000+	Multiple API server instances behind LB; tune --max-requests-inflight	5 × (16 CPU, 64 GB); dedicated nodes; defrag schedule	Shard controllers; consider KCM leader election tuning

API server 429 throttling

When request rate exceeds --max-requests-inflight (default 400) or --max-mutating-requests-inflight (200), the API server returns HTTP 429 Too Many Requests. Clients should retry with exponential backoff.

Symptoms: Controllers flap; kubectl timeouts; Prometheus scrape failures; CI pipelines fail apply
Common causes: Runaway controller loops; excessive watch reconnects; unbounded kubectl get pods -A -w; missing API caching in operators
Mitigations: Raise inflight limits cautiously; add API server replicas (HA); use resourceVersion caching; paginate LIST calls; apply client-side rate limits; split workload across multiple clusters

$ kubectl top nodes
$ kubectl describe nodes | grep -A5 "Allocated resources"
$ kubectl get --raw /metrics | grep apiserver_request_total | grep 429
$ kubectl get --raw /metrics | grep etcd_mvcc_db_total_size_in_bytes
$ # request count by verb — watch for LIST storms
$ kubectl get --raw /metrics | grep 'apiserver_request_total{verb="LIST"' | head$ oc adm top nodes
$ oc get nodes -o json | jq '[.items[] | {name:.metadata.name, cpu:.status.allocatable.cpu, mem:.status.allocatable.memory, pods:.status.allocatable.pods}]'
$ oc get clusteroperator etcd -o yaml
$ # OCP monitoring: openshift-etcd-operator alerts on DB size and fsync latency
$ oc exec -n openshift-etcd etcd-master-0 -c etcdctl -- etcdctl endpoint status -w table

flowchart TB
  subgraph pool["Worker pool allocatable"]
    USED["Scheduled pods\n~80-90%"]
    HEAD["Headroom\n10-20% free"]
    DRAIN["One node drain\nfits in headroom"]
  end
  USED --> HEAD
  HEAD --> DRAIN
  CA["Cluster Autoscaler"] -->|"pods Pending"| HEAD

⚖️ Trade-off

Large nodes vs many small nodes: Fewer larger nodes reduce per-node overhead but each drain/eviction displaces more pods. Many smaller nodes improve granularity but increase IP consumption and kubelet API chatter. Match node size to your P95 pod footprint plus headroom.