Architecture & Control Plane
Every kubectl apply is an HTTPS request. Every pod placement is a scheduler decision. Every self-healing restart is a controller reconciliation loop. This chapter maps the problem each component solves, the primitive it exposes, and what actually happens under the hood—from etcd Raft quorum to CRI-O on OpenShift nodes.
Cluster Architecture Overview
Kubernetes splits responsibilities between a control plane (global decisions, API, state store) and worker nodes (local execution). The control plane never runs your app containers directly—it tells kubelets what to run and continuously reconciles when reality drifts from desired state.
Control plane components
The control plane is the cluster brain. Each component is a separate process (often static pods on dedicated nodes in kubeadm clusters, or operator-managed pods on OpenShift).
| Component | Problem it solves | Key responsibility |
|---|---|---|
| kube-apiserver | No single front door for all cluster operations | REST HTTPS gateway; auth, admission, validation; only writer to etcd |
| etcd | Cluster state must survive process crashes | Distributed key-value store; Raft consensus; source of truth |
| kube-scheduler | Pods need a node; placement is non-trivial | Filter feasible nodes, score, bind pod.spec.nodeName |
| kube-controller-manager | Desired state ≠ actual state without loops | Runs controllers (Deployment, Node, PV, EndpointSlice, …) |
| cloud-controller-manager | Cloud-specific integration pollutes core K8s | Node lifecycle, LB, routes, volumes for AWS/GCP/Azure (optional) |
Worker node components
Worker nodes (and control plane nodes that also run workloads—discouraged in production) host the agents that materialize pods on real hardware.
| Component | Runs on | Role |
|---|---|---|
| kubelet | Every node | Node agent; watches API; drives pod lifecycle via CRI |
| kube-proxy | Every node (usually DaemonSet) | Implements Service ClusterIP/NodePort via iptables, IPVS, or eBPF |
| Container runtime | Every node | containerd or CRI-O → runc (OCI) |
| CNI plugin | Every node | Pod IP assignment, routing (Calico, Cilium, OVN-Kubernetes on OCP) |
End-to-end data flow
You declare a Deployment in YAML. The API server persists it to etcd. The Deployment controller creates ReplicaSets; the ReplicaSet controller creates Pods. Unscheduled pods appear in the scheduler queue; once bound, the kubelet on that node pulls images and starts containers. Status flows back: kubelet → API server → etcd → your kubectl get pods -w watch stream.
flowchart TB
subgraph cp["Control plane nodes"]
API["kube-apiserver"]
ETCD["etcd cluster\nRaft quorum"]
SCH["kube-scheduler"]
CM["kube-controller-manager"]
CCM["cloud-controller-manager"]
end
subgraph w1["Worker node 1"]
KL1["kubelet"]
KP1["kube-proxy"]
CRI1["containerd / CRI-O"]
P1["Pods"]
end
subgraph w2["Worker node 2"]
KL2["kubelet"]
KP2["kube-proxy"]
CRI2["containerd / CRI-O"]
P2["Pods"]
end
CLI["kubectl / oc"] -->|"HTTPS REST"| API
API <-->|"read/write"| ETCD
API --> SCH
API --> CM
API --> CCM
SCH -->|"bind pod"| API
CM -->|"reconcile"| API
KL1 -->|"watch/report"| API
KL2 -->|"watch/report"| API
KL1 --> CRI1 --> P1
KL2 --> CRI2 --> P2
KP1 -.->|"Service VIP"| P1
KP2 -.->|"Service VIP"| P2
CNI["CNI plugin\npod-to-pod"] --- P1
CNI --- P2
$ kubectl get componentstatuses 2>/dev/null || kubectl get --raw='/readyz?verbose' $ kubectl get nodes -o wide $ kubectl get pods -n kube-system -o wide $ kubectl get pods -n openshift-kube-apiserver 2>/dev/null$ oc get clusteroperators → Degraded operators surface control plane issues before workloads fail $ oc get nodes -o custom-columns=NAME:.metadata.name,ROLES:.metadata.labels.node-role\\.kubernetes\\.io/worker,RUNTIME:.status.nodeInfo.containerRuntimeVersion $ oc adm top nodes
Control plane components are stateless relative to etcd—they rebuild in-memory caches from watches on startup. etcd is the only durable state. Lose etcd without backups and you lose the cluster identity, even if nodes still run containers.
Running workloads on control plane nodes saves cost in dev but risks noisy neighbor starvation of API/etcd during traffic spikes. Production: taint control plane nodes with node-role.kubernetes.io/control-plane:NoSchedule and keep them workload-free.
API Server (kube-apiserver)
The API server is the single entry point to cluster state. Every kubectl, oc, controller, scheduler, and kubelet interaction is an HTTPS REST call. There is no backdoor around it.
REST API gateway
Resources are addressed by API group, version, namespace (if namespaced), and name: /apis/apps/v1/namespaces/default/deployments/web. kubectl get pods becomes GET /api/v1/namespaces/<ns>/pods with auth headers from kubeconfig.
Request flow
Every mutating request passes through a strict pipeline before etcd sees a byte:
- Authentication — client cert, bearer token, OIDC (OpenShift OAuth)
- Authorization — RBAC, webhook authorizers, Node authorizer
- Admission — mutating then validating webhooks; built-in controllers (Quota, LimitRanger, PodSecurity)
- Validation — OpenAPI schema, immutability rules
- Persist — encode and write to etcd under /registry/...
- Respond — return object + resourceVersion; emit watch events
flowchart LR C["Client\nkubectl / controller"] --> A["Authentication"] A --> R["RBAC Authorization"] R --> M["Mutating Admission\nwebhooks + PSA + SA"] M --> V["Validating Admission\nwebhooks + Quota"] V --> VAL["Schema Validation"] VAL --> E["etcd write"] E --> W["Watch broadcast"] W --> C
Admission controllers
Mutating admission
Runs before persistence. Can patch objects: inject sidecars, set defaults, add labels. MutatingWebhookConfiguration extends this for custom policy (Kyverno, OPA Gatekeeper).
Validating admission
Runs after mutation. Rejects invalid requests with HTTP 422. Built-ins include ResourceQuota (namespace caps), LimitRanger (default limits), PodSecurity (privileged/baseline/restricted), and ServiceAccount (auto-mount token secrets).
apiVersion: v1
kind: Namespace
metadata:
name: team-payments
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/enforce-version: latest
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
Watch mechanism
Controllers and kubelets do not poll—they watch resources from a resourceVersion bookmark. The API server streams ADD/UPDATE/DELETE events. Efficient cluster operation depends on this; a broken watch cache causes thundering herds on restart.
API groups
| Group | Examples |
|---|---|
| core ("") | Pod, Service, ConfigMap, Secret, Namespace, Node |
| apps | Deployment, ReplicaSet, StatefulSet, DaemonSet |
| batch | Job, CronJob |
| networking.k8s.io | NetworkPolicy, Ingress |
| rbac.authorization.k8s.io | Role, ClusterRole, RoleBinding |
| apiextensions.k8s.io | CustomResourceDefinition → your CRDs |
Server-side apply
Client-side kubectl apply merges JSON locally—fragile with multiple actors. Server-side apply (--server-side) tracks field ownership via metadata.managedFields. Controllers and humans can coexist without clobbering each other's fields.
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: api
template:
metadata:
labels:
app: api
spec:
containers:
- name: api
image: registry.example.com/api:2.4.1
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
memory: 1Gi
$ kubectl api-resources --verbs=list,create -o wide $ kubectl explain deployment.spec.strategy $ kubectl apply -f deploy.yaml --server-side --field-manager=platform-team $ kubectl get --raw /apis/admissionregistration.k8s.io/v1/mutatingwebhookconfigurations$ oc api-resources | grep -E 'route|build|imagestream' → route.openshift.io, build.openshift.io, image.openshift.io $ oc explain route.spec $ oc apply -f deploy.yaml --server-side --field-manager=ci-pipeline
OpenShift API extensions
OpenShift ships additional API groups alongside upstream Kubernetes: route.openshift.io/v1 (HAProxy Routes), build.openshift.io/v1 (BuildConfig, S2I), image.openshift.io/v1 (ImageStream), security.openshift.io/v1 (SecurityContextConstraints). The same API server serves them—oc is aware; vanilla kubectl works for most resources.
"What happens when you run kubectl apply?" Walk the chain: auth → RBAC → mutating admission → validating admission → etcd → controllers react via watch. Mention admission webhooks are synchronous—slow webhooks delay every matching create/update.
A misconfigured failurePolicy: Fail on a mutating webhook blocks all pod creation cluster-wide. Always set timeouts, monitor webhook latency, and use kubectl get --raw /livez during incidents.
etcd
etcd is the cluster's source of truth—a distributed, strongly consistent key-value store. Every Deployment, Secret, and lease lives here. The API server is the only component that should talk to etcd directly.
Distributed KV and Raft consensus
etcd replicates writes across members using the Raft algorithm. A write is committed when a quorum (majority) of nodes acknowledge it. With 3 nodes, tolerate 1 failure; with 5, tolerate 2. Never run an even number of etcd members—split votes waste capacity without extra fault tolerance.
Key structure
Kubernetes objects are stored under /registry/ with paths reflecting resource type:
- /registry/pods/default/nginx
- /registry/deployments/default/web
- /registry/secrets/kube-system/bootstrap-token-abc
Values are protobuf-encoded API objects plus metadata. Event history and leases (for leader election, node heartbeats) also consume keyspace.
Only the API server talks to etcd
Direct etcd access bypasses RBAC and admission—a critical security boundary. Backup tools snapshot via etcdctl with proper certs, not by reading files off disk on running members.
Sizing and performance
- Node count: 3 for most clusters; 5 for higher control plane availability
- Storage: Dedicated NVMe/SSD; latency <10ms; avoid network-attached storage for etcd data dirs
- Quota: Default 2GB backend quota—exceeding it makes the API server reject writes
- Defragmentation: Periodic etcdctl defrag after heavy delete churn
$ ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ endpoint health $ etcdctl endpoint status -w table $ etcdctl snapshot save /backup/etcd-$(date +%F).db $ etcdctl get /registry/ --prefix --keys-only | head$ oc get etcd -o yaml → Cluster etcd operator CR shows members, encryption, defrag schedule $ oc adm etcd backup --backup-dir=/var/tmp/etcd-backup $ oc get clusterversion -o jsonpath='{.items[0].status.conditions[?(@.type=="Progressing")].message}'
OpenShift: encryption at rest
OCP 4.10+ supports etcd encryption at rest via a Kubernetes encryption provider configured in the apiserver cluster resource. Secrets and ConfigMaps are encrypted in etcd; other resources remain plaintext. Enable during install or migration—plan key rotation with the etcd encryption KMS integration.
{
"apiVersion": "apiserver.config.openshift.io/v1",
"kind": "Encryption",
"spec": {
"encryption": {
"resources": [
{
"providers": [
{ "aescbc": { "keys": [{ "name": "key1", "secret": "..." }] } },
{ "identity": {} }
],
"resources": ["secrets", "configmaps"]
}
]
}
}
}
Monitor etcd_mvcc_db_total_size_in_bytes and etcd_server_has_leader. Alert at 80% of quota. Automate daily snapshots; test restore quarterly—an untested backup is wishful thinking.
The most common production etcd incident is disk latency spike on a cloud volume—not CPU. Moving etcd to local SSD or dedicated instances often fixes mysterious API timeouts that look like "network issues."
Scheduler (kube-scheduler)
New pods have spec.nodeName empty. The scheduler picks a node through filter → score → bind. It does not start containers—that is the kubelet's job after binding.
Scheduling pipeline
- Queue — unschedulable pods enter an activeQ / backoffQ
- Filtering — eliminate nodes that cannot fit the pod (hard constraints)
- Scoring — rank remaining nodes (soft preferences)
- Binding — API PATCH sets pod.spec.nodeName (optimistic concurrency)
- Preemption — if no node fits, lower-priority pods may be evicted (optional)
Filter plugins (hard constraints)
| Filter | Rejects node when… |
|---|---|
| NodeSelector / NodeAffinity | Labels don't match required rules |
| TaintToleration | Pod lacks toleration for node taint |
| NodeResourcesFit | Insufficient CPU/memory/ephemeral-storage |
| PodAffinity/ AntiAffinity | Co-location rules cannot be satisfied |
| VolumeBinding | PVC topology or immediate binding constraints fail |
| NodePorts | Requested hostPort already in use on node |
Score plugins (soft preferences)
- LeastAllocated — spread load; prefer nodes with more free resources
- MostAllocated — bin-pack; reduce fragmentation (common with cluster autoscaler)
- ImageLocality — prefer nodes that already have the image pulled
- NodeAffinity — weight preferred affinity terms
- TopologySpread — balance across zones/racks
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-api
spec:
replicas: 6
selector:
matchLabels:
app: payment-api
template:
metadata:
labels:
app: payment-api
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: workload
operator: In
values: [compute]
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: payment-api
topologyKey: kubernetes.io/hostname
tolerations:
- key: dedicated
operator: Equal
value: payments
effect: NoSchedule
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: payment-api
containers:
- name: api
image: payment-api:3.2.0
resources:
requests:
cpu: "1"
memory: 2Gi
Custom schedulers and profiles
Set pod.spec.schedulerName to route pods to a custom scheduler deployment. K8s 1.19+ scheduling profiles let one scheduler binary run multiple plugin configurations. The descheduler (separate project) evicts pods to rebalance violated constraints after initial placement.
$ kubectl describe pod payment-api-7d4f8b-xyz | tail -20 → Events: 0/12 nodes available: 3 Insufficient memory, 2 node(s) had taint … $ kubectl get events --field-selector involvedObject.name=payment-api-7d4f8b-xyz $ kubectl logs -n kube-system -l component=kube-scheduler --tail=50$ oc describe pod payment-api-7d4f8b-xyz | grep -A5 Events $ oc adm inspect ns/production --dest-dir=/tmp/inspect
kubectl describe pod Events are your first stop for Pending pods. "Insufficient cpu" means requests exceed allocatable—not limits. Fix requests or add nodes; tweaking limits alone won't schedule.
MostAllocated maximizes node utilization but increases blast radius when a node fails. LeastAllocated + topology spread improves resilience at the cost of more nodes and cross-AZ traffic.
Controller Manager
Kubernetes is a collection of control loops. Each controller watches a resource type and reconciles actual state toward desired state. The kube-controller-manager bundles dozens of these loops in one binary.
Reconciliation loop pattern
Pseudocode every controller follows:
- Watch API for changes (or periodic resync)
- Enqueue object key into work queue
- Read current state from API
- Compare to desired spec; compute diff
- Emit create/update/delete calls to API server
- Requeue on error with exponential backoff
flowchart LR D["Deployment\nreplicas: 3"] --> RS["ReplicaSet controller\ncreates/updates RS"] RS --> RSC["ReplicaSet\nreplicas: 3"] RSC --> P["Pod controller\ncreates Pods"] P --> POD["3 Running Pods"] POD -->|"node failure"| P P -->|"recreate"| POD
Key controllers
| Controller | Watches | Action |
|---|---|---|
| Deployment | Deployment | Manages ReplicaSets; rolling updates via RS scaling |
| ReplicaSet | ReplicaSet | Maintains pod count matching selector + replicas |
| StatefulSet | StatefulSet | Ordered pods with stable network ID + PVC templates |
| DaemonSet | DaemonSet | One pod per matching node (logging, CNI, kube-proxy) |
| Job / CronJob | Job, CronJob | Run-to-completion workloads; schedule CronJobs |
| Node | Node | Taint nodes on NotReady; evict pods after timeout |
| ServiceAccount | ServiceAccount | Auto-create token Secret (legacy) or TokenRequest |
| EndpointSlice | Service, Pod | Populate backend endpoints for Service VIP |
| Namespace | Namespace | Finalize deletion—remove all namespaced objects |
| PersistentVolume | PV, PVC | Bind claims; recycle/retain per policy |
Leader election
Only one kube-controller-manager instance is active per cluster. Standby replicas compete via coordination.k8s.io/Lease objects—same pattern for scheduler and cloud-controller-manager. Loss of leader triggers failover within seconds; controllers resync from etcd watches.
apiVersion: apps/v1
kind: Deployment
metadata:
name: web
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
containers:
- name: nginx
image: nginx:1.25
ports:
- containerPort: 80
$ kubectl rollout status deployment/web $ kubectl get rs -l app=web --show-labels → Two ReplicaSets during rollout: old + new $ kubectl get lease -n kube-system | grep controller $ kubectl scale deployment web --replicas=5$ oc rollout status deployment/web $ oc get deploymentconfig 2>/dev/null || oc get deploy web -o yaml → OCP 4.16+ prefers Deployment; legacy DeploymentConfig still reconciled by OCP controller
Controllers are level-triggered, not edge-triggered. If they miss an event, the periodic resync (default ~5 min) re-queues everything—self-healing without perfect reliability of the watch stream.
"Deployment vs ReplicaSet?" — ReplicaSet ensures N pods exist. Deployment is a higher-level controller that owns ReplicaSets and implements rolling updates by creating a new RS and scaling down the old one—users rarely create ReplicaSets directly.
kubelet
The kubelet is the node agent. It watches the API for pods bound to its node, instructs the container runtime via CRI, runs probes, reports status, and enforces pod lifecycle on actual hardware.
Pod lifecycle on the node
- kubelet accepts pod spec (API-assigned or static manifest)
- Pull images via CRI if not cached
- Create pod sandbox (network namespace via CNI)
- Start init containers sequentially, then app containers
- Run liveness/readiness/startup probes
- Restart failed containers per restartPolicy
- Report PodStatus phases: Pending → Running → Succeeded/Failed
CRI path: containerd / CRI-O → runc
The kubelet speaks CRI (gRPC)—not Docker directly since K8s 1.24 dropped dockershim. containerd (default on many distros) or CRI-O (OpenShift default) pulls images, creates sandboxes, and invokes runc to start OCI bundles.
flowchart LR API["API Server"] -->|"pod spec"| KL["kubelet"] KL -->|"CRI gRPC"| RT["containerd / CRI-O"] RT --> RUN["runc"] RUN --> C["containers"] KL --> CNI["CNI ADD"] CNI --> C KL -->|"status"| API
Health probes
| Probe | Purpose | Failure action |
|---|---|---|
| Startup | Slow-starting apps (JVM warmup) | Kill container; liveness disabled until success |
| Liveness | Is the process deadlocked? | Restart container |
| Readiness | Can this instance receive traffic? | Remove from Service endpoints (no restart) |
apiVersion: v1
kind: Pod
metadata:
name: api
spec:
containers:
- name: api
image: api:2.1.0
ports:
- containerPort: 8080
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
livenessProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 15
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
memory: 1Gi
Node status, GC, and static pods
The kubelet posts NodeStatus (capacity, allocatable, conditions like Ready, DiskPressure). It garbage-collects dead containers and unused images. Static pods—manifests in /etc/kubernetes/manifests/—are managed locally; the API server mirrors them as mirror pods (how control plane components run on kubeadm clusters).
$ kubectl describe node worker-1 | grep -A10 Conditions $ kubectl get pods -o wide --field-selector spec.nodeName=worker-1 $ crictl pods && crictl ps $ journalctl -u kubelet -f$ oc describe node worker-1 → containerRuntimeVersion shows cri-o://1.28.x on RHCOS $ oc debug node/worker-1 -- chroot /host crictl ps $ oc adm node-logs worker-1 --kubelet
OpenShift: CRI-O default
RHCOS nodes run CRI-O exclusively—no containerd option. Image pulls integrate with internal registry and ImageContentSourcePolicy for disconnected installs. Debugging uses oc debug node/... since SSH is disabled by default.
A liveness probe hitting the same endpoint as readiness restarts pods under load when the app is merely slow—not dead. Use startup probes for JVM apps; keep liveness checks lightweight and distinct from readiness.
ImagePullBackOff at the kubelet layer means registry auth, missing image, or rate limit—not a scheduler issue. Check crictl pull on the node and imagePullSecrets before chasing Deployment controllers.
kube-proxy
Services get a stable virtual IP (ClusterIP). kube-proxy programs node networking so traffic to that VIP reaches healthy pod backends. It implements Service abstraction—not pod-to-pod routing.
What kube-proxy does
Watches Service and EndpointSlice objects. For each Service, installs rules mapping ClusterIP:port → pod IPs (load-balanced). Also handles NodePort and externalIPs by binding host ports or accepting external traffic per mode.
Modes: iptables vs IPVS vs eBPF
| Mode | Mechanism | Trade-offs |
|---|---|---|
| iptables (default legacy) | Chain of NAT rules per Service | Simple; O(n) rules scale poorly; random backend selection |
| IPVS | Kernel IP Virtual Server | Better scalability; L4 load balancing algorithms (rr, lc, dh) |
| eBPF / Cilium | Bypass kube-proxy; Cilium replaces with eBPF maps | Lower latency; unified policy + LB; requires Cilium dataplane |
flowchart LR C["Client pod"] -->|"ClusterIP:80"| KP["kube-proxy rules\niptables/IPVS"] KP --> P1["Pod 10.0.1.5:8080"] KP --> P2["Pod 10.0.2.7:8080"] C2["Pod A"] -->|"direct pod IP"| CNI["CNI routing"] CNI --> P3["Pod B"] note["kube-proxy does NOT handle Pod A → Pod B"] --- CNI
What kube-proxy does NOT do
Pod-to-pod communication is handled by the CNI plugin (routing, overlay, eBPF). kube-proxy only intercepts traffic destined for Service VIPs. DNS resolution ( my-svc.namespace.svc.cluster.local) is CoreDNS—also separate.
apiVersion: v1
kind: Service
metadata:
name: web
spec:
selector:
app: web
ports:
- port: 80
targetPort: 8080
type: ClusterIP
---
apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
name: web-abc
labels:
kubernetes.io/service-name: web
addressType: IPv4
ports:
- port: 8080
protocol: TCP
endpoints:
- addresses: ["10.128.0.12"]
conditions:
ready: true
targetRef:
kind: Pod
name: web-7f8d9-xk2lp
$ kubectl get svc web -o wide $ kubectl get endpointslices -l kubernetes.io/service-name=web $ kubectl get ds -n kube-system kube-proxy -o yaml | grep mode $ iptables-save | grep web-cluster-ip$ oc get svc web $ oc get network/cluster -o yaml → OVN-Kubernetes on OCP; kube-proxy may be disabled when eBPF/OVN handles Services $ oc exec -it debug-pod -- curl -s http://web.default.svc:80/healthz
OCP 4.x defaults to OVN-Kubernetes CNI with distributed service load balancing. Depending on cluster version and network configuration, kube-proxy may not be the active dataplane—check the Network cluster operator before debugging iptables rules.
EndpointSlice controller populates backends; kube-proxy reacts. If pods are Running but Service has no endpoints, check selector labels and readiness probes—not kube-proxy itself.
OpenShift Control Plane Additions
OpenShift wraps upstream Kubernetes with operators that manage cluster lifecycle, node OS, platform services, and enterprise integrations. Understanding these is essential for OCP operations and CKA-adjacent SRE work.
Machine Config Operator (MCO)
RHCOS nodes are immutable—no yum install or SSH patching. MCO renders MachineConfig objects into Ignition configs, drains nodes, and reboots to apply kernel args, kubelet settings, registry certs, and chrony configuration.
Cluster Version Operator (CVO)
CVO drives cluster upgrades—OCP x.y.z → next z-stream or minor version. It coordinates image updates across control plane operators, waits for health, and surfaces progress via ClusterVersion status. Blocked upgrades often mean ClusterOperators are Degraded.
Operator Lifecycle Manager (OLM)
OLM installs and upgrades cluster operators from OperatorHub—CSV (ClusterServiceVersion) lifecycle, CRD ownership, and dependency resolution. Platform teams install cert-manager, Service Mesh, or custom operators through OLM.
Image Registry Operator
Provides a default internal registry (image-registry.openshift-image-registry.svc:5000) or integrates with external S3/GCS. Manages TLS, storage PVC, and routing for oc import-image workflows.
Authentication Operator
Configures OAuth server, identity providers (LDAP, HTPasswd, OIDC), and console login. Integrates with Kubernetes RBAC via OpenShift groups mapped to ClusterRoleBindings.
RHCOS (Red Hat CoreOS)
Minimal, immutable OS purpose-built for containers. Nodes join via Ignition on first boot; updates ship as OSTree images applied by MCO during upgrades—same mechanism as Fedora CoreOS, enterprise-hardened for OCP.
flowchart TB CVO["Cluster Version Operator"] --> CO["ClusterOperators\n50+ platform operators"] MCO["Machine Config Operator"] --> RHCOS["RHCOS nodes\nIgnition + OSTree"] OLM["Operator Lifecycle Manager"] --> OH["OperatorHub CSVs"] AUTH["Authentication Operator"] --> OAuth["OAuth / IdP"] REG["Image Registry Operator"] --> IR["Internal registry"] CO --> API["kube-apiserver"] RHCOS --> KL["kubelet + CRI-O"]
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
name: worker
spec:
machineConfigSelector:
matchLabels:
machineconfiguration.openshift.io/role: worker
nodeSelector:
matchLabels:
node-role.kubernetes.io/worker: ""
paused: false
maxUnavailable: 1
$ # vanilla K8s has no ClusterOperators — use component logs $ kubectl get pods -n kube-system$ oc get clusteroperators $ oc get clusterversion version $ oc get mcp → MASTER/WORKER pools show UPDATED/DEGRADED/UPDATING $ oc get co authentication image-registry kube-apiserver -o yaml $ oc adm upgrade --to=4.14.12
Start every OCP incident with oc get co. If kube-apiserver or etcd operator is Degraded, workload symptoms are downstream. Fix platform operators before debugging application Deployments.
OCP's opinionated stack (CRI-O, OVN, SCC, internal registry) reduces integration toil but increases migration friction from vanilla K8s manifests. Design for portable core APIs; isolate OCP-specific Routes and SCC grants.