Architecture & Control Plane

Cluster Architecture Overview

Kubernetes splits responsibilities between a control plane (global decisions, API, state store) and worker nodes (local execution). The control plane never runs your app containers directly—it tells kubelets what to run and continuously reconciles when reality drifts from desired state.

Control plane components

The control plane is the cluster brain. Each component is a separate process (often static pods on dedicated nodes in kubeadm clusters, or operator-managed pods on OpenShift).

Component	Problem it solves	Key responsibility
kube-apiserver	No single front door for all cluster operations	REST HTTPS gateway; auth, admission, validation; only writer to etcd
etcd	Cluster state must survive process crashes	Distributed key-value store; Raft consensus; source of truth
kube-scheduler	Pods need a node; placement is non-trivial	Filter feasible nodes, score, bind pod.spec.nodeName
kube-controller-manager	Desired state ≠ actual state without loops	Runs controllers (Deployment, Node, PV, EndpointSlice, …)
cloud-controller-manager	Cloud-specific integration pollutes core K8s	Node lifecycle, LB, routes, volumes for AWS/GCP/Azure (optional)

Worker node components

Worker nodes (and control plane nodes that also run workloads—discouraged in production) host the agents that materialize pods on real hardware.

Component	Runs on	Role
kubelet	Every node	Node agent; watches API; drives pod lifecycle via CRI
kube-proxy	Every node (usually DaemonSet)	Implements Service ClusterIP/NodePort via iptables, IPVS, or eBPF
Container runtime	Every node	containerd or CRI-O → runc (OCI)
CNI plugin	Every node	Pod IP assignment, routing (Calico, Cilium, OVN-Kubernetes on OCP)

End-to-end data flow

You declare a Deployment in YAML. The API server persists it to etcd. The Deployment controller creates ReplicaSets; the ReplicaSet controller creates Pods. Unscheduled pods appear in the scheduler queue; once bound, the kubelet on that node pulls images and starts containers. Status flows back: kubelet → API server → etcd → your kubectl get pods -w watch stream.

flowchart TB
  subgraph cp["Control plane nodes"]
    API["kube-apiserver"]
    ETCD["etcd cluster\nRaft quorum"]
    SCH["kube-scheduler"]
    CM["kube-controller-manager"]
    CCM["cloud-controller-manager"]
  end
  subgraph w1["Worker node 1"]
    KL1["kubelet"]
    KP1["kube-proxy"]
    CRI1["containerd / CRI-O"]
    P1["Pods"]
  end
  subgraph w2["Worker node 2"]
    KL2["kubelet"]
    KP2["kube-proxy"]
    CRI2["containerd / CRI-O"]
    P2["Pods"]
  end
  CLI["kubectl / oc"] -->|"HTTPS REST"| API
  API <-->|"read/write"| ETCD
  API --> SCH
  API --> CM
  API --> CCM
  SCH -->|"bind pod"| API
  CM -->|"reconcile"| API
  KL1 -->|"watch/report"| API
  KL2 -->|"watch/report"| API
  KL1 --> CRI1 --> P1
  KL2 --> CRI2 --> P2
  KP1 -.->|"Service VIP"| P1
  KP2 -.->|"Service VIP"| P2
  CNI["CNI plugin\npod-to-pod"] --- P1
  CNI --- P2

$ kubectl get componentstatuses 2>/dev/null || kubectl get --raw='/readyz?verbose'
$ kubectl get nodes -o wide
$ kubectl get pods -n kube-system -o wide
$ kubectl get pods -n openshift-kube-apiserver 2>/dev/null$ oc get clusteroperators
→ Degraded operators surface control plane issues before workloads fail
$ oc get nodes -o custom-columns=NAME:.metadata.name,ROLES:.metadata.labels.node-role\\.kubernetes\\.io/worker,RUNTIME:.status.nodeInfo.containerRuntimeVersion
$ oc adm top nodes

🔬 Under the Hood

Control plane components are stateless relative to etcd—they rebuild in-memory caches from watches on startup. etcd is the only durable state. Lose etcd without backups and you lose the cluster identity, even if nodes still run containers.

⚖️ Trade-off

Running workloads on control plane nodes saves cost in dev but risks noisy neighbor starvation of API/etcd during traffic spikes. Production: taint control plane nodes with node-role.kubernetes.io/control-plane:NoSchedule and keep them workload-free.

API Server (kube-apiserver)

The API server is the single entry point to cluster state. Every kubectl, oc, controller, scheduler, and kubelet interaction is an HTTPS REST call. There is no backdoor around it.

REST API gateway

Resources are addressed by API group, version, namespace (if namespaced), and name: /apis/apps/v1/namespaces/default/deployments/web. kubectl get pods becomes GET /api/v1/namespaces/<ns>/pods with auth headers from kubeconfig.

Request flow

Every mutating request passes through a strict pipeline before etcd sees a byte:

Authentication — client cert, bearer token, OIDC (OpenShift OAuth)
Authorization — RBAC, webhook authorizers, Node authorizer
Admission — mutating then validating webhooks; built-in controllers (Quota, LimitRanger, PodSecurity)
Validation — OpenAPI schema, immutability rules
Persist — encode and write to etcd under /registry/...
Respond — return object + resourceVersion; emit watch events

flowchart LR
  C["Client\nkubectl / controller"] --> A["Authentication"]
  A --> R["RBAC Authorization"]
  R --> M["Mutating Admission\nwebhooks + PSA + SA"]
  M --> V["Validating Admission\nwebhooks + Quota"]
  V --> VAL["Schema Validation"]
  VAL --> E["etcd write"]
  E --> W["Watch broadcast"]
  W --> C

Admission controllers

Mutating admission

Runs before persistence. Can patch objects: inject sidecars, set defaults, add labels. MutatingWebhookConfiguration extends this for custom policy (Kyverno, OPA Gatekeeper).

Validating admission

Runs after mutation. Rejects invalid requests with HTTP 422. Built-ins include ResourceQuota (namespace caps), LimitRanger (default limits), PodSecurity (privileged/baseline/restricted), and ServiceAccount (auto-mount token secrets).

apiVersion: v1
kind: Namespace
metadata:
  name: team-payments
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/enforce-version: latest
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

Watch mechanism

Controllers and kubelets do not poll—they watch resources from a resourceVersion bookmark. The API server streams ADD/UPDATE/DELETE events. Efficient cluster operation depends on this; a broken watch cache causes thundering herds on restart.

API groups

Group	Examples
core ("")	Pod, Service, ConfigMap, Secret, Namespace, Node
apps	Deployment, ReplicaSet, StatefulSet, DaemonSet
batch	Job, CronJob
networking.k8s.io	NetworkPolicy, Ingress
rbac.authorization.k8s.io	Role, ClusterRole, RoleBinding
apiextensions.k8s.io	CustomResourceDefinition → your CRDs

Server-side apply

Client-side kubectl apply merges JSON locally—fragile with multiple actors. Server-side apply (--server-side) tracks field ownership via metadata.managedFields. Controllers and humans can coexist without clobbering each other's fields.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      containers:
        - name: api
          image: registry.example.com/api:2.4.1
          resources:
            requests:
              cpu: 250m
              memory: 512Mi
            limits:
              memory: 1Gi

$ kubectl api-resources --verbs=list,create -o wide
$ kubectl explain deployment.spec.strategy
$ kubectl apply -f deploy.yaml --server-side --field-manager=platform-team
$ kubectl get --raw /apis/admissionregistration.k8s.io/v1/mutatingwebhookconfigurations$ oc api-resources | grep -E 'route|build|imagestream'
→ route.openshift.io, build.openshift.io, image.openshift.io
$ oc explain route.spec
$ oc apply -f deploy.yaml --server-side --field-manager=ci-pipeline

OpenShift API extensions

OpenShift ships additional API groups alongside upstream Kubernetes: route.openshift.io/v1 (HAProxy Routes), build.openshift.io/v1 (BuildConfig, S2I), image.openshift.io/v1 (ImageStream), security.openshift.io/v1 (SecurityContextConstraints). The same API server serves them—oc is aware; vanilla kubectl works for most resources.

🎯 Interview Tip

"What happens when you run kubectl apply?" Walk the chain: auth → RBAC → mutating admission → validating admission → etcd → controllers react via watch. Mention admission webhooks are synchronous—slow webhooks delay every matching create/update.

⚠️ Pitfall

A misconfigured failurePolicy: Fail on a mutating webhook blocks all pod creation cluster-wide. Always set timeouts, monitor webhook latency, and use kubectl get --raw /livez during incidents.

etcd

etcd is the cluster's source of truth—a distributed, strongly consistent key-value store. Every Deployment, Secret, and lease lives here. The API server is the only component that should talk to etcd directly.

Distributed KV and Raft consensus

etcd replicates writes across members using the Raft algorithm. A write is committed when a quorum (majority) of nodes acknowledge it. With 3 nodes, tolerate 1 failure; with 5, tolerate 2. Never run an even number of etcd members—split votes waste capacity without extra fault tolerance.

Key structure

Kubernetes objects are stored under /registry/ with paths reflecting resource type:

/registry/pods/default/nginx
/registry/deployments/default/web
/registry/secrets/kube-system/bootstrap-token-abc

Values are protobuf-encoded API objects plus metadata. Event history and leases (for leader election, node heartbeats) also consume keyspace.

Only the API server talks to etcd

Direct etcd access bypasses RBAC and admission—a critical security boundary. Backup tools snapshot via etcdctl with proper certs, not by reading files off disk on running members.

Sizing and performance

Node count: 3 for most clusters; 5 for higher control plane availability
Storage: Dedicated NVMe/SSD; latency <10ms; avoid network-attached storage for etcd data dirs
Quota: Default 2GB backend quota—exceeding it makes the API server reject writes
Defragmentation: Periodic etcdctl defrag after heavy delete churn

$ ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint health
$ etcdctl endpoint status -w table
$ etcdctl snapshot save /backup/etcd-$(date +%F).db
$ etcdctl get /registry/ --prefix --keys-only | head$ oc get etcd -o yaml
→ Cluster etcd operator CR shows members, encryption, defrag schedule
$ oc adm etcd backup --backup-dir=/var/tmp/etcd-backup
$ oc get clusterversion -o jsonpath='{.items[0].status.conditions[?(@.type=="Progressing")].message}'

OpenShift: encryption at rest

OCP 4.10+ supports etcd encryption at rest via a Kubernetes encryption provider configured in the apiserver cluster resource. Secrets and ConfigMaps are encrypted in etcd; other resources remain plaintext. Enable during install or migration—plan key rotation with the etcd encryption KMS integration.

{
  "apiVersion": "apiserver.config.openshift.io/v1",
  "kind": "Encryption",
  "spec": {
    "encryption": {
      "resources": [
        {
          "providers": [
            { "aescbc": { "keys": [{ "name": "key1", "secret": "..." }] } },
            { "identity": {} }
          ],
          "resources": ["secrets", "configmaps"]
        }
      ]
    }
  }
}

⚙️ Config

Monitor etcd_mvcc_db_total_size_in_bytes and etcd_server_has_leader. Alert at 80% of quota. Automate daily snapshots; test restore quarterly—an untested backup is wishful thinking.

📦 Real World

The most common production etcd incident is disk latency spike on a cloud volume—not CPU. Moving etcd to local SSD or dedicated instances often fixes mysterious API timeouts that look like "network issues."

Scheduler (kube-scheduler)

New pods have spec.nodeName empty. The scheduler picks a node through filter → score → bind. It does not start containers—that is the kubelet's job after binding.

Scheduling pipeline

Queue — unschedulable pods enter an activeQ / backoffQ
Filtering — eliminate nodes that cannot fit the pod (hard constraints)
Scoring — rank remaining nodes (soft preferences)
Binding — API PATCH sets pod.spec.nodeName (optimistic concurrency)
Preemption — if no node fits, lower-priority pods may be evicted (optional)

Filter plugins (hard constraints)

Filter	Rejects node when…
NodeSelector / NodeAffinity	Labels don't match required rules
TaintToleration	Pod lacks toleration for node taint
NodeResourcesFit	Insufficient CPU/memory/ephemeral-storage
PodAffinity/ AntiAffinity	Co-location rules cannot be satisfied
VolumeBinding	PVC topology or immediate binding constraints fail
NodePorts	Requested hostPort already in use on node

Score plugins (soft preferences)

LeastAllocated — spread load; prefer nodes with more free resources
MostAllocated — bin-pack; reduce fragmentation (common with cluster autoscaler)
ImageLocality — prefer nodes that already have the image pulled
NodeAffinity — weight preferred affinity terms
TopologySpread — balance across zones/racks

apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-api
spec:
  replicas: 6
  selector:
    matchLabels:
      app: payment-api
  template:
    metadata:
      labels:
        app: payment-api
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: workload
                    operator: In
                    values: [compute]
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app: payment-api
                topologyKey: kubernetes.io/hostname
      tolerations:
        - key: dedicated
          operator: Equal
          value: payments
          effect: NoSchedule
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: payment-api
      containers:
        - name: api
          image: payment-api:3.2.0
          resources:
            requests:
              cpu: "1"
              memory: 2Gi

Custom schedulers and profiles

Set pod.spec.schedulerName to route pods to a custom scheduler deployment. K8s 1.19+ scheduling profiles let one scheduler binary run multiple plugin configurations. The descheduler (separate project) evicts pods to rebalance violated constraints after initial placement.

$ kubectl describe pod payment-api-7d4f8b-xyz | tail -20
→ Events: 0/12 nodes available: 3 Insufficient memory, 2 node(s) had taint …
$ kubectl get events --field-selector involvedObject.name=payment-api-7d4f8b-xyz
$ kubectl logs -n kube-system -l component=kube-scheduler --tail=50$ oc describe pod payment-api-7d4f8b-xyz | grep -A5 Events
$ oc adm inspect ns/production --dest-dir=/tmp/inspect

💡 Pro Tip

kubectl describe pod Events are your first stop for Pending pods. "Insufficient cpu" means requests exceed allocatable—not limits. Fix requests or add nodes; tweaking limits alone won't schedule.

⚖️ Trade-off

MostAllocated maximizes node utilization but increases blast radius when a node fails. LeastAllocated + topology spread improves resilience at the cost of more nodes and cross-AZ traffic.

Controller Manager

Kubernetes is a collection of control loops. Each controller watches a resource type and reconciles actual state toward desired state. The kube-controller-manager bundles dozens of these loops in one binary.

Reconciliation loop pattern

Pseudocode every controller follows:

Watch API for changes (or periodic resync)
Enqueue object key into work queue
Read current state from API
Compare to desired spec; compute diff
Emit create/update/delete calls to API server
Requeue on error with exponential backoff

flowchart LR
  D["Deployment\nreplicas: 3"] --> RS["ReplicaSet controller\ncreates/updates RS"]
  RS --> RSC["ReplicaSet\nreplicas: 3"]
  RSC --> P["Pod controller\ncreates Pods"]
  P --> POD["3 Running Pods"]
  POD -->|"node failure"| P
  P -->|"recreate"| POD

Key controllers

Controller	Watches	Action
Deployment	Deployment	Manages ReplicaSets; rolling updates via RS scaling
ReplicaSet	ReplicaSet	Maintains pod count matching selector + replicas
StatefulSet	StatefulSet	Ordered pods with stable network ID + PVC templates
DaemonSet	DaemonSet	One pod per matching node (logging, CNI, kube-proxy)
Job / CronJob	Job, CronJob	Run-to-completion workloads; schedule CronJobs
Node	Node	Taint nodes on NotReady; evict pods after timeout
ServiceAccount	ServiceAccount	Auto-create token Secret (legacy) or TokenRequest
EndpointSlice	Service, Pod	Populate backend endpoints for Service VIP
Namespace	Namespace	Finalize deletion—remove all namespaced objects
PersistentVolume	PV, PVC	Bind claims; recycle/retain per policy

Leader election

Only one kube-controller-manager instance is active per cluster. Standby replicas compete via coordination.k8s.io/Lease objects—same pattern for scheduler and cloud-controller-manager. Loss of leader triggers failover within seconds; controllers resync from etcd watches.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
        - name: nginx
          image: nginx:1.25
          ports:
            - containerPort: 80

$ kubectl rollout status deployment/web
$ kubectl get rs -l app=web --show-labels
→ Two ReplicaSets during rollout: old + new
$ kubectl get lease -n kube-system | grep controller
$ kubectl scale deployment web --replicas=5$ oc rollout status deployment/web
$ oc get deploymentconfig 2>/dev/null || oc get deploy web -o yaml
→ OCP 4.16+ prefers Deployment; legacy DeploymentConfig still reconciled by OCP controller

🔬 Under the Hood

Controllers are level-triggered, not edge-triggered. If they miss an event, the periodic resync (default ~5 min) re-queues everything—self-healing without perfect reliability of the watch stream.

🎯 Interview Tip

"Deployment vs ReplicaSet?" — ReplicaSet ensures N pods exist. Deployment is a higher-level controller that owns ReplicaSets and implements rolling updates by creating a new RS and scaling down the old one—users rarely create ReplicaSets directly.

kubelet

The kubelet is the node agent. It watches the API for pods bound to its node, instructs the container runtime via CRI, runs probes, reports status, and enforces pod lifecycle on actual hardware.

Pod lifecycle on the node

kubelet accepts pod spec (API-assigned or static manifest)
Pull images via CRI if not cached
Create pod sandbox (network namespace via CNI)
Start init containers sequentially, then app containers
Run liveness/readiness/startup probes
Restart failed containers per restartPolicy
Report PodStatus phases: Pending → Running → Succeeded/Failed

CRI path: containerd / CRI-O → runc

The kubelet speaks CRI (gRPC)—not Docker directly since K8s 1.24 dropped dockershim. containerd (default on many distros) or CRI-O (OpenShift default) pulls images, creates sandboxes, and invokes runc to start OCI bundles.

flowchart LR
  API["API Server"] -->|"pod spec"| KL["kubelet"]
  KL -->|"CRI gRPC"| RT["containerd / CRI-O"]
  RT --> RUN["runc"]
  RUN --> C["containers"]
  KL --> CNI["CNI ADD"]
  CNI --> C
  KL -->|"status"| API

Health probes

Probe	Purpose	Failure action
Startup	Slow-starting apps (JVM warmup)	Kill container; liveness disabled until success
Liveness	Is the process deadlocked?	Restart container
Readiness	Can this instance receive traffic?	Remove from Service endpoints (no restart)

apiVersion: v1
kind: Pod
metadata:
  name: api
spec:
  containers:
    - name: api
      image: api:2.1.0
      ports:
        - containerPort: 8080
      startupProbe:
        httpGet:
          path: /healthz
          port: 8080
        failureThreshold: 30
        periodSeconds: 10
      readinessProbe:
        httpGet:
          path: /ready
          port: 8080
        periodSeconds: 5
      livenessProbe:
        httpGet:
          path: /healthz
          port: 8080
        periodSeconds: 15
      resources:
        requests:
          cpu: 500m
          memory: 512Mi
        limits:
          memory: 1Gi

Node status, GC, and static pods

The kubelet posts NodeStatus (capacity, allocatable, conditions like Ready, DiskPressure). It garbage-collects dead containers and unused images. Static pods—manifests in /etc/kubernetes/manifests/—are managed locally; the API server mirrors them as mirror pods (how control plane components run on kubeadm clusters).

$ kubectl describe node worker-1 | grep -A10 Conditions
$ kubectl get pods -o wide --field-selector spec.nodeName=worker-1
$ crictl pods && crictl ps
$ journalctl -u kubelet -f$ oc describe node worker-1
→ containerRuntimeVersion shows cri-o://1.28.x on RHCOS
$ oc debug node/worker-1 -- chroot /host crictl ps
$ oc adm node-logs worker-1 --kubelet

OpenShift: CRI-O default

RHCOS nodes run CRI-O exclusively—no containerd option. Image pulls integrate with internal registry and ImageContentSourcePolicy for disconnected installs. Debugging uses oc debug node/... since SSH is disabled by default.

⚠️ Pitfall

A liveness probe hitting the same endpoint as readiness restarts pods under load when the app is merely slow—not dead. Use startup probes for JVM apps; keep liveness checks lightweight and distinct from readiness.

📦 Real World

ImagePullBackOff at the kubelet layer means registry auth, missing image, or rate limit—not a scheduler issue. Check crictl pull on the node and imagePullSecrets before chasing Deployment controllers.

kube-proxy

Services get a stable virtual IP (ClusterIP). kube-proxy programs node networking so traffic to that VIP reaches healthy pod backends. It implements Service abstraction—not pod-to-pod routing.

What kube-proxy does

Watches Service and EndpointSlice objects. For each Service, installs rules mapping ClusterIP:port → pod IPs (load-balanced). Also handles NodePort and externalIPs by binding host ports or accepting external traffic per mode.

Modes: iptables vs IPVS vs eBPF

Mode	Mechanism	Trade-offs
iptables (default legacy)	Chain of NAT rules per Service	Simple; O(n) rules scale poorly; random backend selection
IPVS	Kernel IP Virtual Server	Better scalability; L4 load balancing algorithms (rr, lc, dh)
eBPF / Cilium	Bypass kube-proxy; Cilium replaces with eBPF maps	Lower latency; unified policy + LB; requires Cilium dataplane

flowchart LR
  C["Client pod"] -->|"ClusterIP:80"| KP["kube-proxy rules\niptables/IPVS"]
  KP --> P1["Pod 10.0.1.5:8080"]
  KP --> P2["Pod 10.0.2.7:8080"]
  C2["Pod A"] -->|"direct pod IP"| CNI["CNI routing"]
  CNI --> P3["Pod B"]
  note["kube-proxy does NOT handle Pod A → Pod B"] --- CNI

What kube-proxy does NOT do

Pod-to-pod communication is handled by the CNI plugin (routing, overlay, eBPF). kube-proxy only intercepts traffic destined for Service VIPs. DNS resolution ( my-svc.namespace.svc.cluster.local) is CoreDNS—also separate.

apiVersion: v1
kind: Service
metadata:
  name: web
spec:
  selector:
    app: web
  ports:
    - port: 80
      targetPort: 8080
  type: ClusterIP
---
apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
  name: web-abc
  labels:
    kubernetes.io/service-name: web
addressType: IPv4
ports:
  - port: 8080
    protocol: TCP
endpoints:
  - addresses: ["10.128.0.12"]
    conditions:
      ready: true
    targetRef:
      kind: Pod
      name: web-7f8d9-xk2lp

$ kubectl get svc web -o wide
$ kubectl get endpointslices -l kubernetes.io/service-name=web
$ kubectl get ds -n kube-system kube-proxy -o yaml | grep mode
$ iptables-save | grep web-cluster-ip$ oc get svc web
$ oc get network/cluster -o yaml
→ OVN-Kubernetes on OCP; kube-proxy may be disabled when eBPF/OVN handles Services
$ oc exec -it debug-pod -- curl -s http://web.default.svc:80/healthz

🔴 OpenShift

OCP 4.x defaults to OVN-Kubernetes CNI with distributed service load balancing. Depending on cluster version and network configuration, kube-proxy may not be the active dataplane—check the Network cluster operator before debugging iptables rules.

🔬 Under the Hood

EndpointSlice controller populates backends; kube-proxy reacts. If pods are Running but Service has no endpoints, check selector labels and readiness probes—not kube-proxy itself.

OpenShift Control Plane Additions

OpenShift wraps upstream Kubernetes with operators that manage cluster lifecycle, node OS, platform services, and enterprise integrations. Understanding these is essential for OCP operations and CKA-adjacent SRE work.

Machine Config Operator (MCO)

RHCOS nodes are immutable—no yum install or SSH patching. MCO renders MachineConfig objects into Ignition configs, drains nodes, and reboots to apply kernel args, kubelet settings, registry certs, and chrony configuration.

Cluster Version Operator (CVO)

CVO drives cluster upgrades—OCP x.y.z → next z-stream or minor version. It coordinates image updates across control plane operators, waits for health, and surfaces progress via ClusterVersion status. Blocked upgrades often mean ClusterOperators are Degraded.

Operator Lifecycle Manager (OLM)

OLM installs and upgrades cluster operators from OperatorHub—CSV (ClusterServiceVersion) lifecycle, CRD ownership, and dependency resolution. Platform teams install cert-manager, Service Mesh, or custom operators through OLM.

Image Registry Operator

Provides a default internal registry (image-registry.openshift-image-registry.svc:5000) or integrates with external S3/GCS. Manages TLS, storage PVC, and routing for oc import-image workflows.

Authentication Operator

Configures OAuth server, identity providers (LDAP, HTPasswd, OIDC), and console login. Integrates with Kubernetes RBAC via OpenShift groups mapped to ClusterRoleBindings.

RHCOS (Red Hat CoreOS)

Minimal, immutable OS purpose-built for containers. Nodes join via Ignition on first boot; updates ship as OSTree images applied by MCO during upgrades—same mechanism as Fedora CoreOS, enterprise-hardened for OCP.

flowchart TB
  CVO["Cluster Version Operator"] --> CO["ClusterOperators\n50+ platform operators"]
  MCO["Machine Config Operator"] --> RHCOS["RHCOS nodes\nIgnition + OSTree"]
  OLM["Operator Lifecycle Manager"] --> OH["OperatorHub CSVs"]
  AUTH["Authentication Operator"] --> OAuth["OAuth / IdP"]
  REG["Image Registry Operator"] --> IR["Internal registry"]
  CO --> API["kube-apiserver"]
  RHCOS --> KL["kubelet + CRI-O"]

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: worker
spec:
  machineConfigSelector:
    matchLabels:
      machineconfiguration.openshift.io/role: worker
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker: ""
  paused: false
  maxUnavailable: 1

$ # vanilla K8s has no ClusterOperators — use component logs
$ kubectl get pods -n kube-system$ oc get clusteroperators
$ oc get clusterversion version
$ oc get mcp
→ MASTER/WORKER pools show UPDATED/DEGRADED/UPDATING
$ oc get co authentication image-registry kube-apiserver -o yaml
$ oc adm upgrade --to=4.14.12

💡 Pro Tip

Start every OCP incident with oc get co. If kube-apiserver or etcd operator is Degraded, workload symptoms are downstream. Fix platform operators before debugging application Deployments.

⚖️ Trade-off

OCP's opinionated stack (CRI-O, OVN, SCC, internal registry) reduces integration toil but increases migration friction from vanilla K8s manifests. Design for portable core APIs; isolate OCP-specific Routes and SCC grants.