Advanced Scheduling & Node Management

The default scheduler picks any node with enough CPU and memory—but production clusters need finer control: dedicated GPU pools, spot vs on-demand tiers, zone-balanced replicas, and safe node maintenance. This chapter covers taints & tolerations, affinity rules, topology spread, node lifecycle operations, and resource QoS from requests/limits through namespace quotas.

devops architect K8s 1.29+ OCP 4.16+
CLI

Taints & Tolerations

Taints repel pods from nodes unless the pod declares a matching toleration. Think of taints as "keep out" signs on nodes and tolerations as keys that unlock specific nodes. Unlike affinity (which pulls pods toward nodes), taints push workloads away—ideal for dedicated hardware and control-plane isolation.

Taint syntax: key=value:effect

A taint is a triple: key, optional value, and effect. The value is optional but commonly set for clarity.

Effect Behavior Typical use
NoSchedule Hard reject—scheduler will not place new pods without a toleration GPU nodes, master/control-plane, dedicated batch pools
PreferNoSchedule Soft reject—scheduler tries to avoid but may schedule if no alternative Spot/preemptible nodes, degraded hardware, cost-optimized tiers
NoExecute Evicts existing pods without toleration; blocks new scheduling Node maintenance, cordon-like isolation, NotReady auto-taint

Common use cases

  • GPU nodes — Taint nvidia.com/gpu=present:NoSchedule; only ML workloads with matching toleration land there.
  • Spot / preemptible — Taint lifecycle=spot:PreferNoSchedule or NoSchedule; fault-tolerant jobs tolerate interruption.
  • Master / control-plane — Auto-taint node-role.kubernetes.io/control-plane:NoSchedule prevents user workloads on etcd/API nodes.
yaml — node taint + pod toleration
# Applied to node (or via MachineSet / node pool config):
# kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoSchedule

apiVersion: v1
kind: Pod
metadata:
  name: training-job
spec:
  tolerations:
    - key: "nvidia.com/gpu"
      operator: "Equal"
      value: "present"
      effect: "NoSchedule"
    - key: "lifecycle"
      operator: "Equal"
      value: "spot"
      effect: "NoSchedule"
  nodeSelector:
    accelerator: nvidia-a100
  containers:
    - name: trainer
      image: pytorch/pytorch:2.2-cuda12.1-cudnn8-runtime
      resources:
        limits:
          nvidia.com/gpu: "1"

tolerationSeconds

For taints with effect NoExecute, tolerationSeconds controls how long a pod may remain after the taint appears before eviction. The kubelet automatically adds node.kubernetes.io/not-ready:NoExecute after a node becomes NotReady for 5 minutes (300s default)—pods without a toleration for that taint are evicted. Setting tolerationSeconds: 3600 on a toleration gives a one-hour grace period during brief network blips.

yaml — tolerationSeconds for NotReady grace
tolerations:
  - key: "node.kubernetes.io/not-ready"
    operator: "Exists"
    effect: "NoExecute"
    tolerationSeconds: 300
  - key: "node.kubernetes.io/unreachable"
    operator: "Exists"
    effect: "NoExecute"
    tolerationSeconds: 300
terminal — taint node
$ kubectl taint nodes worker-gpu-1 nvidia.com/gpu=present:NoSchedule
$ kubectl taint nodes worker-gpu-1 nvidia.com/gpu=present:NoSchedule-
→ trailing minus removes the taint
$ kubectl describe node worker-gpu-1 | grep -A3 Taints$ oc adm taint nodes worker-gpu-1 lifecycle=spot:PreferNoSchedule
$ oc describe node worker-gpu-1 | grep Taints
🔬 Under the Hood

The scheduler filters nodes in two passes: first taints/tolerations (hard filter), then scoring (resources, affinity, topology spread). A pod with no tolerations can still land on a tainted node only if the node has zero taints. Taints do not guarantee placement—pair with nodeSelector or affinity to pull workloads onto dedicated pools.

⚠️ Pitfall

Taint without toleration = Pending forever. Adding a GPU taint to all GPU nodes without updating Deployments leaves pods stuck in Pending with "didn't tolerate taint" events. Roll out tolerations in the same change window as node pool taints.

💡 Pro Tip

Use operator: Exists with no value to tolerate any value for a key—handy for well-known node condition taints like node.kubernetes.io/not-ready.

🎯 Interview Tip

"Taints vs node affinity?" — Taints are node-side repulsion; tolerations are pod-side permission. Affinity is pod-side attraction. Dedicated GPU pools typically use taint + toleration; zone preference uses affinity.

Node Affinity & Anti-Affinity

Affinity rules express where pods want to run relative to node labels or other pods. nodeSelector is the simplest form; nodeAffinity adds required vs preferred semantics. Pod affinity/anti-affinity co-locate or separate pods using topologyKey (usually kubernetes.io/hostname or zone labels).

NodeSelector (legacy simple form)

Exact key/value match on node labels. All expressions must match (AND). No soft preferences—if no node matches, pod stays Pending.

yaml — nodeSelector
spec:
  nodeSelector:
    disktype: ssd
    topology.kubernetes.io/zone: us-east-1a

NodeAffinity — required vs preferred

Field Type Behavior
requiredDuringSchedulingIgnoredDuringExecution Hard Must match or pod is not scheduled (Pending)
preferredDuringSchedulingIgnoredDuringExecution Soft Scheduler scores nodes; weight 1–100 boosts preference
yaml — nodeAffinity + podAntiAffinity
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
        tier: backend
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: node.kubernetes.io/instance-type
                    operator: In
                    values: ["m6i.xlarge", "m6i.2xlarge"]
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 80
              preference:
                matchExpressions:
                  - key: topology.kubernetes.io/zone
                    operator: In
                    values: ["us-east-1a", "us-east-1b"]
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchLabels:
                  app: api
              topologyKey: kubernetes.io/hostname
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app: cache
                topologyKey: kubernetes.io/hostname

Pod Affinity / Anti-Affinity

Pod affinity schedules near pods matching a label selector within a topology domain. Pod anti-affinity spreads replicas across domains—classic HA pattern: one replica per node or per zone. topologyKey defines the boundary: kubernetes.io/hostname for node-level spread, topology.kubernetes.io/zone for AZ-level spread.

flowchart TB
  subgraph zone_a["zone us-east-1a"]
    N1[node-1]
    N2[node-2]
  end
  subgraph zone_b["zone us-east-1b"]
    N3[node-3]
  end
  P1[api pod replica-1] --> N1
  P2[api pod replica-2] --> N2
  P3[api pod replica-3] --> N3
  AA["podAntiAffinity topologyKey=zone"]
⚖️ Trade-off

Required anti-affinity vs topology spread: Hard pod anti-affinity on kubernetes.io/hostname prevents two replicas on the same node but can block scaling when node count < replica count. Topology spread constraints (next section) offer softer skew limits and scale more gracefully.

⚙️ Config

Operators: In, NotIn, Exists, DoesNotExist, Gt, Lt (for numeric label values). Combine multiple nodeSelectorTerms with OR semantics across terms, AND within a term.

📦 Real World

Stateful services (Kafka, Elasticsearch) use required pod anti-affinity on hostname so brokers never share a node. Batch jobs use preferred node affinity toward cheaper instance types. OpenShift infra nodes are selected via node-role.kubernetes.io/infra labels—not user-facing affinity in most cases.

⚠️ Pitfall

Anti-affinity + small clusters. requiredDuringScheduling pod anti-affinity with 5 replicas on 3 nodes leaves 2 pods Pending forever. Use preferred rules or topology spread with whenUnsatisfiable: ScheduleAnyway.

Topology Spread Constraints

topologySpreadConstraints (stable since 1.19) distribute pods evenly across topology domains—zones, regions, or nodes—using a configurable max skew. They often replace brittle required pod anti-affinity for multi-zone HA.

Key fields

Field Purpose
maxSkew Max difference in pod count between any two domains (e.g. skew ≤ 1 means balanced)
topologyKey Node label defining domains—topology.kubernetes.io/zone or kubernetes.io/hostname
whenUnsatisfiable DoNotSchedule (hard) or ScheduleAnyway (best effort)
labelSelector / matchLabelKeys Which pods count toward skew—usually same app label; pod-template-hash for rolling updates
yaml — topologySpreadConstraints (zone + node)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  replicas: 6
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: web
          matchLabelKeys:
            - pod-template-hash
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
          labelSelector:
            matchLabels:
              app: web

Replacing anti-affinity for zones

Instead of required pod anti-affinity on zone (which fails when zones are asymmetric), use: maxSkew: 1, topologyKey: topology.kubernetes.io/zone, whenUnsatisfiable: DoNotSchedule. With 3 zones and 6 replicas, you get 2+2+2—not 4+2+0. For node-level spread, swap topologyKey to kubernetes.io/hostname.

🔬 Under the Hood

The scheduler plugin PodTopologySpread counts matching pods per domain and rejects placements that would exceed maxSkew. matchLabelKeys: [pod-template-hash] ensures old and new ReplicaSet pods during rollouts are counted separately—preventing temporary imbalance that breaks hard constraints.

💡 Pro Tip

Cluster-level default constraints via PodTopologySpread scheduling gates (1.27+) or mutating webhooks can enforce zone spread without every team copying YAML—platform teams publish a baseline, apps opt out explicitly.

🎯 Interview Tip

"How do you spread pods across AZs?" — Prefer topologySpreadConstraints with maxSkew: 1 over required pod anti-affinity: same HA outcome, better behavior when zones or replica counts change.

📦 Real World

EKS and GKE multi-AZ clusters label nodes with standard topology keys. Platform policies often require DoNotSchedule zone spread for tier-1 services and ScheduleAnyway hostname spread so burst scaling still works on constrained node pools.

Node Management

Nodes leave the fleet for upgrades, scaling down, or failure. Platform engineers use cordon, drain, and uncordon to evict workloads safely. Understanding node conditions, kubelet eviction thresholds, and OpenShift Machine Config Pools is essential for zero-downtime maintenance.

Cordon, drain, uncordon

Action Effect
cordon Marks node unschedulable (spec.unschedulable: true); existing pods keep running
drain Cordons + evicts pods (respects PDBs); use --ignore-daemonsets for DaemonSet pods
uncordon Re-enables scheduling after maintenance completes
terminal — cordon / drain / uncordon
$ kubectl cordon worker-3
node/worker-3 cordoned
$ kubectl drain worker-3 \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --grace-period=120 \
  --timeout=10m
evicting pod default/web-7d4f8b9c-xk2lm
node/worker-3 drained
$ kubectl uncordon worker-3
$ kubectl get nodes -o wide
$ kubectl describe node worker-3 | grep -A10 Conditions$ oc adm cordon worker-3
$ oc adm drain worker-3 \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --force \
  --grace-period=120
→ --force evicts pods not managed by RC/RS/Job (use cautiously)
$ oc adm uncordon worker-3
$ oc get nodes
$ oc describe node worker-3

Node conditions

The kubelet reports conditions on each node. The scheduler and controllers react to unhealthy signals.

Condition Meaning
Ready Node healthy and accepting pods (False = NotReady)
MemoryPressure Node low on memory—may trigger eviction
DiskPressure Disk or inode pressure on node filesystem / emptyDir
PIDPressure Too many processes on the node
NetworkUnavailable Node network not configured (CNI still starting)

Kubelet eviction thresholds

When node resources drop below eviction thresholds, the kubelet ranks pods by QoS and usage, then evicts BestEffort first, then Burstable exceeding requests, then Guaranteed last. Thresholds are configured via kubelet flags or KubeletConfiguration: memory.available, nodefs.available, imagefs.available, pid.available.

yaml — kubelet evictionHard excerpt
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
evictionHard:
  memory.available: "500Mi"
  nodefs.available: "10%"
  imagefs.available: "15%"
evictionSoft:
  memory.available: "1Gi"
evictionSoftGracePeriod:
  memory.available: "1m30s"
evictionMinimumReclaim:
  memory.available: "0Mi"

NotReady taint (5 minutes)

When a node transitions to NotReady, the control plane waits 5 minutes (default) before adding node.kubernetes.io/not-ready:NoExecute. Pods without a matching toleration are evicted so workloads reschedule elsewhere. The same applies to node.kubernetes.io/unreachable when the node controller loses contact.

OpenShift: Machine Config Pool

On OCP, nodes are grouped into Machine Config Pools (MCP)—master, worker, and custom pools (e.g. gpu, infra). MachineConfig changes roll out pool-by-pool: nodes cordon, drain, reboot with new config, uncordon. Always drain one MCP segment at a time; watch oc get mcp for UPDATED / UPDATING status.

🔴 OpenShift

Use oc adm drain (not plain kubectl drain) on OCP worker nodes during upgrades—it understands Red Hat-specific pod annotations and MCP coordination. Check oc get machineconfigpool before patching kubelet or OS settings via MachineConfig.

⚠️ Pitfall

Drain blocked by PDB. Single-replica Deployments with minAvailable: 1 prevent eviction. Scale up temporarily or accept downtime during node maintenance.

🔒 Security

Restrict who can cordon/drain nodes via RBAC (nodes/status patch). Malicious drain of all workers is a denial-of-service vector—audit oc adm drain and kubectl drain in production clusters.

📦 Real World

Managed Kubernetes (EKS, GKE, AKS, ROSA) automates node rotation—your job is ensuring PDBs, surge capacity, and zone spread so automated drains never violate SLOs. Run game-day drills: drain one node per AZ and verify HPA + spread constraints recover QPS within minutes.

Resource Management

Without requests and limits, one noisy neighbor can starve the node. Kubernetes classifies pods into QoS classes, enforces namespace ResourceQuotas and LimitRanges, and uses PriorityClass for preemption during contention.

Requests vs limits

Field Scheduler Runtime behavior
requests Used for scheduling—sum of requests must fit node allocatable CPU: guaranteed share when contended; Memory: soft reservation for eviction ordering
limits Not used for scheduling (unless LimitRange default) CPU: CFS quota cap (throttling); Memory: hard cap → OOMKill when exceeded

CPU throttling & memory OOMKill

  • CPU — Container exceeds its CPU limit → throttled (CFS bandwidth). Symptoms: latency spikes without pod restart. Fix: raise limit or lower load; consider omitting CPU limit for latency-sensitive apps (keep request).
  • Memory — Container exceeds memory limit → Linux OOM killer terminates the container (exit 137). Unlike CPU, memory is not compressible—always set limits ≤ node capacity with headroom.

QoS classes

Class Criteria Eviction order
Guaranteed Every container: limits == requests (for CPU/memory); limits must be set Last evicted
Burstable At least one container has requests or limits set; not Guaranteed Middle—evicted if usage > requests
BestEffort No requests or limits on any container First evicted
yaml — production resources + PriorityClass
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: tier-1-production
value: 1000000
globalDefault: false
description: "Critical payment path — preempts lower tiers"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payments-api
spec:
  replicas: 3
  template:
    spec:
      priorityClassName: tier-1-production
      containers:
        - name: api
          image: payments:2.4.0
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"
            limits:
              cpu: "2000m"
              memory: "1Gi"
          # QoS: Burstable (limits != requests)
---
# Guaranteed QoS example — limits == requests for all containers
apiVersion: v1
kind: Pod
metadata:
  name: guaranteed-example
spec:
  containers:
    - name: app
      image: nginx:1.25
      resources:
        requests:
          cpu: "250m"
          memory: "256Mi"
        limits:
          cpu: "250m"
          memory: "256Mi"

Production recommendations

  • Always set memory requests ≈ expected working set; limits at 1.2–1.5× request for Java/Go heap growth.
  • Set CPU requests from p95 usage (VPA or metrics); avoid CPU limits on latency-critical services or set limit ≥ 2× request.
  • Run tier-1 workloads as Guaranteed or high-priority Burstable with accurate requests.
  • Never run production apps as BestEffort—first evicted under node pressure.
  • Use PriorityClass so platform components and payment paths preempt batch jobs.

ResourceQuota

Caps aggregate resource consumption per namespace—prevents one team from consuming the entire cluster. Quota is enforced at admission time (create/update pod).

yaml — ResourceQuota + LimitRange
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-payments-quota
  namespace: payments
spec:
  hard:
    requests.cpu: "20"
    requests.memory: 40Gi
    limits.cpu: "40"
    limits.memory: 80Gi
    pods: "50"
    persistentvolumeclaims: "10"
---
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: payments
spec:
  limits:
    - type: Container
      default:
        cpu: "500m"
        memory: "512Mi"
      defaultRequest:
        cpu: "100m"
        memory: "128Mi"
      max:
        cpu: "4"
        memory: 8Gi
      min:
        cpu: "50m"
        memory: 64Mi
terminal — quota & QoS inspection
$ kubectl describe quota -n payments
$ kubectl describe limitrange -n payments
$ kubectl get pods -n payments -o custom-columns=NAME:.metadata.name,QOS:.status.qosClass
$ kubectl top pods -n payments$ oc describe quota -n payments
$ oc adm quota -n payments --list
$ oc get pods -n payments -o custom-columns=NAME:.metadata.name,QOS:.status.qosClass
$ oc adm top pods -n payments
⚙️ Config

OpenShift applies default cluster ResourceQuotas per ProjectRequestTemplate. Override with custom templates or ClusterResourceQuota (oc get clusterresourcequota) for multi-namespace caps.

⚖️ Trade-off

CPU limits on/off: Omitting CPU limits avoids CFS throttling surprises but allows burst to consume full node CPU—acceptable on dedicated nodes, risky on multi-tenant pools. Memory limits should never be omitted in production; OOM at node level is worse than container-level OOMKill.

🎯 Interview Tip

"Guaranteed vs Burstable?" — Guaranteed: limits equal requests on all containers; last evicted, stable CPU shares. Burstable: most common; can burst above request until limit; evicted when node pressure if usage exceeds request. BestEffort: no resources set; first evicted.

💡 Pro Tip

Run VPA in recommendation mode, then promote to requests. Pair with LimitRange defaultRequest so developers who omit resources still get sane baselines—quota enforcement then works predictably.

⚠️ Pitfall

LimitRange default ≠ request in manifest. Pods created without resources get LimitRange defaults but may still show Burstable QoS. For Guaranteed, explicitly set equal requests and limits in the pod spec—defaults alone won't achieve it.