Advanced Scheduling & Node Management
The default scheduler picks any node with enough CPU and memory—but production clusters need finer control: dedicated GPU pools, spot vs on-demand tiers, zone-balanced replicas, and safe node maintenance. This chapter covers taints & tolerations, affinity rules, topology spread, node lifecycle operations, and resource QoS from requests/limits through namespace quotas.
Taints & Tolerations
Taints repel pods from nodes unless the pod declares a matching toleration. Think of taints as "keep out" signs on nodes and tolerations as keys that unlock specific nodes. Unlike affinity (which pulls pods toward nodes), taints push workloads away—ideal for dedicated hardware and control-plane isolation.
Taint syntax: key=value:effect
A taint is a triple: key, optional value, and effect. The value is optional but commonly set for clarity.
| Effect | Behavior | Typical use |
|---|---|---|
| NoSchedule | Hard reject—scheduler will not place new pods without a toleration | GPU nodes, master/control-plane, dedicated batch pools |
| PreferNoSchedule | Soft reject—scheduler tries to avoid but may schedule if no alternative | Spot/preemptible nodes, degraded hardware, cost-optimized tiers |
| NoExecute | Evicts existing pods without toleration; blocks new scheduling | Node maintenance, cordon-like isolation, NotReady auto-taint |
Common use cases
- GPU nodes — Taint nvidia.com/gpu=present:NoSchedule; only ML workloads with matching toleration land there.
- Spot / preemptible — Taint lifecycle=spot:PreferNoSchedule or NoSchedule; fault-tolerant jobs tolerate interruption.
- Master / control-plane — Auto-taint node-role.kubernetes.io/control-plane:NoSchedule prevents user workloads on etcd/API nodes.
# Applied to node (or via MachineSet / node pool config):
# kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoSchedule
apiVersion: v1
kind: Pod
metadata:
name: training-job
spec:
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "present"
effect: "NoSchedule"
- key: "lifecycle"
operator: "Equal"
value: "spot"
effect: "NoSchedule"
nodeSelector:
accelerator: nvidia-a100
containers:
- name: trainer
image: pytorch/pytorch:2.2-cuda12.1-cudnn8-runtime
resources:
limits:
nvidia.com/gpu: "1"
tolerationSeconds
For taints with effect NoExecute, tolerationSeconds controls how long a pod may remain after the taint appears before eviction. The kubelet automatically adds node.kubernetes.io/not-ready:NoExecute after a node becomes NotReady for 5 minutes (300s default)—pods without a toleration for that taint are evicted. Setting tolerationSeconds: 3600 on a toleration gives a one-hour grace period during brief network blips.
tolerations:
- key: "node.kubernetes.io/not-ready"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 300
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 300
$ kubectl taint nodes worker-gpu-1 nvidia.com/gpu=present:NoSchedule $ kubectl taint nodes worker-gpu-1 nvidia.com/gpu=present:NoSchedule- → trailing minus removes the taint $ kubectl describe node worker-gpu-1 | grep -A3 Taints$ oc adm taint nodes worker-gpu-1 lifecycle=spot:PreferNoSchedule $ oc describe node worker-gpu-1 | grep Taints
The scheduler filters nodes in two passes: first taints/tolerations (hard filter), then scoring (resources, affinity, topology spread). A pod with no tolerations can still land on a tainted node only if the node has zero taints. Taints do not guarantee placement—pair with nodeSelector or affinity to pull workloads onto dedicated pools.
Taint without toleration = Pending forever. Adding a GPU taint to all GPU nodes without updating Deployments leaves pods stuck in Pending with "didn't tolerate taint" events. Roll out tolerations in the same change window as node pool taints.
Use operator: Exists with no value to tolerate any value for a key—handy for well-known node condition taints like node.kubernetes.io/not-ready.
"Taints vs node affinity?" — Taints are node-side repulsion; tolerations are pod-side permission. Affinity is pod-side attraction. Dedicated GPU pools typically use taint + toleration; zone preference uses affinity.
Node Affinity & Anti-Affinity
Affinity rules express where pods want to run relative to node labels or other pods. nodeSelector is the simplest form; nodeAffinity adds required vs preferred semantics. Pod affinity/anti-affinity co-locate or separate pods using topologyKey (usually kubernetes.io/hostname or zone labels).
NodeSelector (legacy simple form)
Exact key/value match on node labels. All expressions must match (AND). No soft preferences—if no node matches, pod stays Pending.
spec:
nodeSelector:
disktype: ssd
topology.kubernetes.io/zone: us-east-1a
NodeAffinity — required vs preferred
| Field | Type | Behavior |
|---|---|---|
| requiredDuringSchedulingIgnoredDuringExecution | Hard | Must match or pod is not scheduled (Pending) |
| preferredDuringSchedulingIgnoredDuringExecution | Soft | Scheduler scores nodes; weight 1–100 boosts preference |
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
replicas: 3
selector:
matchLabels:
app: api
template:
metadata:
labels:
app: api
tier: backend
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values: ["m6i.xlarge", "m6i.2xlarge"]
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values: ["us-east-1a", "us-east-1b"]
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: api
topologyKey: kubernetes.io/hostname
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: cache
topologyKey: kubernetes.io/hostname
Pod Affinity / Anti-Affinity
Pod affinity schedules near pods matching a label selector within a topology domain. Pod anti-affinity spreads replicas across domains—classic HA pattern: one replica per node or per zone. topologyKey defines the boundary: kubernetes.io/hostname for node-level spread, topology.kubernetes.io/zone for AZ-level spread.
flowchart TB
subgraph zone_a["zone us-east-1a"]
N1[node-1]
N2[node-2]
end
subgraph zone_b["zone us-east-1b"]
N3[node-3]
end
P1[api pod replica-1] --> N1
P2[api pod replica-2] --> N2
P3[api pod replica-3] --> N3
AA["podAntiAffinity topologyKey=zone"]
Required anti-affinity vs topology spread: Hard pod anti-affinity on kubernetes.io/hostname prevents two replicas on the same node but can block scaling when node count < replica count. Topology spread constraints (next section) offer softer skew limits and scale more gracefully.
Operators: In, NotIn, Exists, DoesNotExist, Gt, Lt (for numeric label values). Combine multiple nodeSelectorTerms with OR semantics across terms, AND within a term.
Stateful services (Kafka, Elasticsearch) use required pod anti-affinity on hostname so brokers never share a node. Batch jobs use preferred node affinity toward cheaper instance types. OpenShift infra nodes are selected via node-role.kubernetes.io/infra labels—not user-facing affinity in most cases.
Anti-affinity + small clusters. requiredDuringScheduling pod anti-affinity with 5 replicas on 3 nodes leaves 2 pods Pending forever. Use preferred rules or topology spread with whenUnsatisfiable: ScheduleAnyway.
Topology Spread Constraints
topologySpreadConstraints (stable since 1.19) distribute pods evenly across topology domains—zones, regions, or nodes—using a configurable max skew. They often replace brittle required pod anti-affinity for multi-zone HA.
Key fields
| Field | Purpose |
|---|---|
| maxSkew | Max difference in pod count between any two domains (e.g. skew ≤ 1 means balanced) |
| topologyKey | Node label defining domains—topology.kubernetes.io/zone or kubernetes.io/hostname |
| whenUnsatisfiable | DoNotSchedule (hard) or ScheduleAnyway (best effort) |
| labelSelector / matchLabelKeys | Which pods count toward skew—usually same app label; pod-template-hash for rolling updates |
apiVersion: apps/v1
kind: Deployment
metadata:
name: web
spec:
replicas: 6
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: web
matchLabelKeys:
- pod-template-hash
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: web
Replacing anti-affinity for zones
Instead of required pod anti-affinity on zone (which fails when zones are asymmetric), use: maxSkew: 1, topologyKey: topology.kubernetes.io/zone, whenUnsatisfiable: DoNotSchedule. With 3 zones and 6 replicas, you get 2+2+2—not 4+2+0. For node-level spread, swap topologyKey to kubernetes.io/hostname.
The scheduler plugin PodTopologySpread counts matching pods per domain and rejects placements that would exceed maxSkew. matchLabelKeys: [pod-template-hash] ensures old and new ReplicaSet pods during rollouts are counted separately—preventing temporary imbalance that breaks hard constraints.
Cluster-level default constraints via PodTopologySpread scheduling gates (1.27+) or mutating webhooks can enforce zone spread without every team copying YAML—platform teams publish a baseline, apps opt out explicitly.
"How do you spread pods across AZs?" — Prefer topologySpreadConstraints with maxSkew: 1 over required pod anti-affinity: same HA outcome, better behavior when zones or replica counts change.
EKS and GKE multi-AZ clusters label nodes with standard topology keys. Platform policies often require DoNotSchedule zone spread for tier-1 services and ScheduleAnyway hostname spread so burst scaling still works on constrained node pools.
Node Management
Nodes leave the fleet for upgrades, scaling down, or failure. Platform engineers use cordon, drain, and uncordon to evict workloads safely. Understanding node conditions, kubelet eviction thresholds, and OpenShift Machine Config Pools is essential for zero-downtime maintenance.
Cordon, drain, uncordon
| Action | Effect |
|---|---|
| cordon | Marks node unschedulable (spec.unschedulable: true); existing pods keep running |
| drain | Cordons + evicts pods (respects PDBs); use --ignore-daemonsets for DaemonSet pods |
| uncordon | Re-enables scheduling after maintenance completes |
$ kubectl cordon worker-3 node/worker-3 cordoned $ kubectl drain worker-3 \ --ignore-daemonsets \ --delete-emptydir-data \ --grace-period=120 \ --timeout=10m evicting pod default/web-7d4f8b9c-xk2lm node/worker-3 drained $ kubectl uncordon worker-3 $ kubectl get nodes -o wide $ kubectl describe node worker-3 | grep -A10 Conditions$ oc adm cordon worker-3 $ oc adm drain worker-3 \ --ignore-daemonsets \ --delete-emptydir-data \ --force \ --grace-period=120 → --force evicts pods not managed by RC/RS/Job (use cautiously) $ oc adm uncordon worker-3 $ oc get nodes $ oc describe node worker-3
Node conditions
The kubelet reports conditions on each node. The scheduler and controllers react to unhealthy signals.
| Condition | Meaning |
|---|---|
| Ready | Node healthy and accepting pods (False = NotReady) |
| MemoryPressure | Node low on memory—may trigger eviction |
| DiskPressure | Disk or inode pressure on node filesystem / emptyDir |
| PIDPressure | Too many processes on the node |
| NetworkUnavailable | Node network not configured (CNI still starting) |
Kubelet eviction thresholds
When node resources drop below eviction thresholds, the kubelet ranks pods by QoS and usage, then evicts BestEffort first, then Burstable exceeding requests, then Guaranteed last. Thresholds are configured via kubelet flags or KubeletConfiguration: memory.available, nodefs.available, imagefs.available, pid.available.
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
evictionHard:
memory.available: "500Mi"
nodefs.available: "10%"
imagefs.available: "15%"
evictionSoft:
memory.available: "1Gi"
evictionSoftGracePeriod:
memory.available: "1m30s"
evictionMinimumReclaim:
memory.available: "0Mi"
NotReady taint (5 minutes)
When a node transitions to NotReady, the control plane waits 5 minutes (default) before adding node.kubernetes.io/not-ready:NoExecute. Pods without a matching toleration are evicted so workloads reschedule elsewhere. The same applies to node.kubernetes.io/unreachable when the node controller loses contact.
OpenShift: Machine Config Pool
On OCP, nodes are grouped into Machine Config Pools (MCP)—master, worker, and custom pools (e.g. gpu, infra). MachineConfig changes roll out pool-by-pool: nodes cordon, drain, reboot with new config, uncordon. Always drain one MCP segment at a time; watch oc get mcp for UPDATED / UPDATING status.
Use oc adm drain (not plain kubectl drain) on OCP worker nodes during upgrades—it understands Red Hat-specific pod annotations and MCP coordination. Check oc get machineconfigpool before patching kubelet or OS settings via MachineConfig.
Drain blocked by PDB. Single-replica Deployments with minAvailable: 1 prevent eviction. Scale up temporarily or accept downtime during node maintenance.
Restrict who can cordon/drain nodes via RBAC (nodes/status patch). Malicious drain of all workers is a denial-of-service vector—audit oc adm drain and kubectl drain in production clusters.
Managed Kubernetes (EKS, GKE, AKS, ROSA) automates node rotation—your job is ensuring PDBs, surge capacity, and zone spread so automated drains never violate SLOs. Run game-day drills: drain one node per AZ and verify HPA + spread constraints recover QPS within minutes.
Resource Management
Without requests and limits, one noisy neighbor can starve the node. Kubernetes classifies pods into QoS classes, enforces namespace ResourceQuotas and LimitRanges, and uses PriorityClass for preemption during contention.
Requests vs limits
| Field | Scheduler | Runtime behavior |
|---|---|---|
| requests | Used for scheduling—sum of requests must fit node allocatable | CPU: guaranteed share when contended; Memory: soft reservation for eviction ordering |
| limits | Not used for scheduling (unless LimitRange default) | CPU: CFS quota cap (throttling); Memory: hard cap → OOMKill when exceeded |
CPU throttling & memory OOMKill
- CPU — Container exceeds its CPU limit → throttled (CFS bandwidth). Symptoms: latency spikes without pod restart. Fix: raise limit or lower load; consider omitting CPU limit for latency-sensitive apps (keep request).
- Memory — Container exceeds memory limit → Linux OOM killer terminates the container (exit 137). Unlike CPU, memory is not compressible—always set limits ≤ node capacity with headroom.
QoS classes
| Class | Criteria | Eviction order |
|---|---|---|
| Guaranteed | Every container: limits == requests (for CPU/memory); limits must be set | Last evicted |
| Burstable | At least one container has requests or limits set; not Guaranteed | Middle—evicted if usage > requests |
| BestEffort | No requests or limits on any container | First evicted |
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: tier-1-production
value: 1000000
globalDefault: false
description: "Critical payment path — preempts lower tiers"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: payments-api
spec:
replicas: 3
template:
spec:
priorityClassName: tier-1-production
containers:
- name: api
image: payments:2.4.0
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2000m"
memory: "1Gi"
# QoS: Burstable (limits != requests)
---
# Guaranteed QoS example — limits == requests for all containers
apiVersion: v1
kind: Pod
metadata:
name: guaranteed-example
spec:
containers:
- name: app
image: nginx:1.25
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "250m"
memory: "256Mi"
Production recommendations
- Always set memory requests ≈ expected working set; limits at 1.2–1.5× request for Java/Go heap growth.
- Set CPU requests from p95 usage (VPA or metrics); avoid CPU limits on latency-critical services or set limit ≥ 2× request.
- Run tier-1 workloads as Guaranteed or high-priority Burstable with accurate requests.
- Never run production apps as BestEffort—first evicted under node pressure.
- Use PriorityClass so platform components and payment paths preempt batch jobs.
ResourceQuota
Caps aggregate resource consumption per namespace—prevents one team from consuming the entire cluster. Quota is enforced at admission time (create/update pod).
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-payments-quota
namespace: payments
spec:
hard:
requests.cpu: "20"
requests.memory: 40Gi
limits.cpu: "40"
limits.memory: 80Gi
pods: "50"
persistentvolumeclaims: "10"
---
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: payments
spec:
limits:
- type: Container
default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"
max:
cpu: "4"
memory: 8Gi
min:
cpu: "50m"
memory: 64Mi
$ kubectl describe quota -n payments $ kubectl describe limitrange -n payments $ kubectl get pods -n payments -o custom-columns=NAME:.metadata.name,QOS:.status.qosClass $ kubectl top pods -n payments$ oc describe quota -n payments $ oc adm quota -n payments --list $ oc get pods -n payments -o custom-columns=NAME:.metadata.name,QOS:.status.qosClass $ oc adm top pods -n payments
OpenShift applies default cluster ResourceQuotas per ProjectRequestTemplate. Override with custom templates or ClusterResourceQuota (oc get clusterresourcequota) for multi-namespace caps.
CPU limits on/off: Omitting CPU limits avoids CFS throttling surprises but allows burst to consume full node CPU—acceptable on dedicated nodes, risky on multi-tenant pools. Memory limits should never be omitted in production; OOM at node level is worse than container-level OOMKill.
"Guaranteed vs Burstable?" — Guaranteed: limits equal requests on all containers; last evicted, stable CPU shares. Burstable: most common; can burst above request until limit; evicted when node pressure if usage exceeds request. BestEffort: no resources set; first evicted.
Run VPA in recommendation mode, then promote to requests. Pair with LimitRange defaultRequest so developers who omit resources still get sane baselines—quota enforcement then works predictably.
LimitRange default ≠ request in manifest. Pods created without resources get LimitRange defaults but may still show Burstable QoS. For Guaranteed, explicitly set equal requests and limits in the pod spec—defaults alone won't achieve it.