Architecture & Control Plane

Every kubectl apply is an HTTPS request. Every pod placement is a scheduler decision. Every self-healing restart is a controller reconciliation loop. This chapter maps the problem each component solves, the primitive it exposes, and what actually happens under the hood—from etcd Raft quorum to CRI-O on OpenShift nodes.

developer devops architect K8s 1.28+ OCP 4.12+ CKA

Cluster Architecture Overview

Kubernetes splits responsibilities between a control plane (global decisions, API, state store) and worker nodes (local execution). The control plane never runs your app containers directly—it tells kubelets what to run and continuously reconciles when reality drifts from desired state.

Control plane components

The control plane is the cluster brain. Each component is a separate process (often static pods on dedicated nodes in kubeadm clusters, or operator-managed pods on OpenShift).

Component Problem it solves Key responsibility
kube-apiserver No single front door for all cluster operations REST HTTPS gateway; auth, admission, validation; only writer to etcd
etcd Cluster state must survive process crashes Distributed key-value store; Raft consensus; source of truth
kube-scheduler Pods need a node; placement is non-trivial Filter feasible nodes, score, bind pod.spec.nodeName
kube-controller-manager Desired state ≠ actual state without loops Runs controllers (Deployment, Node, PV, EndpointSlice, …)
cloud-controller-manager Cloud-specific integration pollutes core K8s Node lifecycle, LB, routes, volumes for AWS/GCP/Azure (optional)

Worker node components

Worker nodes (and control plane nodes that also run workloads—discouraged in production) host the agents that materialize pods on real hardware.

Component Runs on Role
kubelet Every node Node agent; watches API; drives pod lifecycle via CRI
kube-proxy Every node (usually DaemonSet) Implements Service ClusterIP/NodePort via iptables, IPVS, or eBPF
Container runtime Every node containerd or CRI-Orunc (OCI)
CNI plugin Every node Pod IP assignment, routing (Calico, Cilium, OVN-Kubernetes on OCP)

End-to-end data flow

You declare a Deployment in YAML. The API server persists it to etcd. The Deployment controller creates ReplicaSets; the ReplicaSet controller creates Pods. Unscheduled pods appear in the scheduler queue; once bound, the kubelet on that node pulls images and starts containers. Status flows back: kubelet → API server → etcd → your kubectl get pods -w watch stream.

flowchart TB
  subgraph cp["Control plane nodes"]
    API["kube-apiserver"]
    ETCD["etcd cluster\nRaft quorum"]
    SCH["kube-scheduler"]
    CM["kube-controller-manager"]
    CCM["cloud-controller-manager"]
  end
  subgraph w1["Worker node 1"]
    KL1["kubelet"]
    KP1["kube-proxy"]
    CRI1["containerd / CRI-O"]
    P1["Pods"]
  end
  subgraph w2["Worker node 2"]
    KL2["kubelet"]
    KP2["kube-proxy"]
    CRI2["containerd / CRI-O"]
    P2["Pods"]
  end
  CLI["kubectl / oc"] -->|"HTTPS REST"| API
  API <-->|"read/write"| ETCD
  API --> SCH
  API --> CM
  API --> CCM
  SCH -->|"bind pod"| API
  CM -->|"reconcile"| API
  KL1 -->|"watch/report"| API
  KL2 -->|"watch/report"| API
  KL1 --> CRI1 --> P1
  KL2 --> CRI2 --> P2
  KP1 -.->|"Service VIP"| P1
  KP2 -.->|"Service VIP"| P2
  CNI["CNI plugin\npod-to-pod"] --- P1
  CNI --- P2
terminal — cluster component health
$ kubectl get componentstatuses 2>/dev/null || kubectl get --raw='/readyz?verbose'
$ kubectl get nodes -o wide
$ kubectl get pods -n kube-system -o wide
$ kubectl get pods -n openshift-kube-apiserver 2>/dev/null$ oc get clusteroperators
→ Degraded operators surface control plane issues before workloads fail
$ oc get nodes -o custom-columns=NAME:.metadata.name,ROLES:.metadata.labels.node-role\\.kubernetes\\.io/worker,RUNTIME:.status.nodeInfo.containerRuntimeVersion
$ oc adm top nodes
🔬 Under the Hood

Control plane components are stateless relative to etcd—they rebuild in-memory caches from watches on startup. etcd is the only durable state. Lose etcd without backups and you lose the cluster identity, even if nodes still run containers.

⚖️ Trade-off

Running workloads on control plane nodes saves cost in dev but risks noisy neighbor starvation of API/etcd during traffic spikes. Production: taint control plane nodes with node-role.kubernetes.io/control-plane:NoSchedule and keep them workload-free.

API Server (kube-apiserver)

The API server is the single entry point to cluster state. Every kubectl, oc, controller, scheduler, and kubelet interaction is an HTTPS REST call. There is no backdoor around it.

REST API gateway

Resources are addressed by API group, version, namespace (if namespaced), and name: /apis/apps/v1/namespaces/default/deployments/web. kubectl get pods becomes GET /api/v1/namespaces/<ns>/pods with auth headers from kubeconfig.

Request flow

Every mutating request passes through a strict pipeline before etcd sees a byte:

  1. Authentication — client cert, bearer token, OIDC (OpenShift OAuth)
  2. Authorization — RBAC, webhook authorizers, Node authorizer
  3. Admission — mutating then validating webhooks; built-in controllers (Quota, LimitRanger, PodSecurity)
  4. Validation — OpenAPI schema, immutability rules
  5. Persist — encode and write to etcd under /registry/...
  6. Respond — return object + resourceVersion; emit watch events
flowchart LR
  C["Client\nkubectl / controller"] --> A["Authentication"]
  A --> R["RBAC Authorization"]
  R --> M["Mutating Admission\nwebhooks + PSA + SA"]
  M --> V["Validating Admission\nwebhooks + Quota"]
  V --> VAL["Schema Validation"]
  VAL --> E["etcd write"]
  E --> W["Watch broadcast"]
  W --> C

Admission controllers

Mutating admission

Runs before persistence. Can patch objects: inject sidecars, set defaults, add labels. MutatingWebhookConfiguration extends this for custom policy (Kyverno, OPA Gatekeeper).

Validating admission

Runs after mutation. Rejects invalid requests with HTTP 422. Built-ins include ResourceQuota (namespace caps), LimitRanger (default limits), PodSecurity (privileged/baseline/restricted), and ServiceAccount (auto-mount token secrets).

yaml — Pod Security Admission labels
apiVersion: v1
kind: Namespace
metadata:
  name: team-payments
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/enforce-version: latest
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

Watch mechanism

Controllers and kubelets do not poll—they watch resources from a resourceVersion bookmark. The API server streams ADD/UPDATE/DELETE events. Efficient cluster operation depends on this; a broken watch cache causes thundering herds on restart.

API groups

Group Examples
core ("") Pod, Service, ConfigMap, Secret, Namespace, Node
apps Deployment, ReplicaSet, StatefulSet, DaemonSet
batch Job, CronJob
networking.k8s.io NetworkPolicy, Ingress
rbac.authorization.k8s.io Role, ClusterRole, RoleBinding
apiextensions.k8s.io CustomResourceDefinition → your CRDs

Server-side apply

Client-side kubectl apply merges JSON locally—fragile with multiple actors. Server-side apply (--server-side) tracks field ownership via metadata.managedFields. Controllers and humans can coexist without clobbering each other's fields.

yaml — server-side apply field manager
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      containers:
        - name: api
          image: registry.example.com/api:2.4.1
          resources:
            requests:
              cpu: 250m
              memory: 512Mi
            limits:
              memory: 1Gi
terminal — API discovery
$ kubectl api-resources --verbs=list,create -o wide
$ kubectl explain deployment.spec.strategy
$ kubectl apply -f deploy.yaml --server-side --field-manager=platform-team
$ kubectl get --raw /apis/admissionregistration.k8s.io/v1/mutatingwebhookconfigurations$ oc api-resources | grep -E 'route|build|imagestream'
→ route.openshift.io, build.openshift.io, image.openshift.io
$ oc explain route.spec
$ oc apply -f deploy.yaml --server-side --field-manager=ci-pipeline

OpenShift API extensions

OpenShift ships additional API groups alongside upstream Kubernetes: route.openshift.io/v1 (HAProxy Routes), build.openshift.io/v1 (BuildConfig, S2I), image.openshift.io/v1 (ImageStream), security.openshift.io/v1 (SecurityContextConstraints). The same API server serves them—oc is aware; vanilla kubectl works for most resources.

🎯 Interview Tip

"What happens when you run kubectl apply?" Walk the chain: auth → RBAC → mutating admission → validating admission → etcd → controllers react via watch. Mention admission webhooks are synchronous—slow webhooks delay every matching create/update.

⚠️ Pitfall

A misconfigured failurePolicy: Fail on a mutating webhook blocks all pod creation cluster-wide. Always set timeouts, monitor webhook latency, and use kubectl get --raw /livez during incidents.

etcd

etcd is the cluster's source of truth—a distributed, strongly consistent key-value store. Every Deployment, Secret, and lease lives here. The API server is the only component that should talk to etcd directly.

Distributed KV and Raft consensus

etcd replicates writes across members using the Raft algorithm. A write is committed when a quorum (majority) of nodes acknowledge it. With 3 nodes, tolerate 1 failure; with 5, tolerate 2. Never run an even number of etcd members—split votes waste capacity without extra fault tolerance.

Key structure

Kubernetes objects are stored under /registry/ with paths reflecting resource type:

  • /registry/pods/default/nginx
  • /registry/deployments/default/web
  • /registry/secrets/kube-system/bootstrap-token-abc

Values are protobuf-encoded API objects plus metadata. Event history and leases (for leader election, node heartbeats) also consume keyspace.

Only the API server talks to etcd

Direct etcd access bypasses RBAC and admission—a critical security boundary. Backup tools snapshot via etcdctl with proper certs, not by reading files off disk on running members.

Sizing and performance

  • Node count: 3 for most clusters; 5 for higher control plane availability
  • Storage: Dedicated NVMe/SSD; latency <10ms; avoid network-attached storage for etcd data dirs
  • Quota: Default 2GB backend quota—exceeding it makes the API server reject writes
  • Defragmentation: Periodic etcdctl defrag after heavy delete churn
terminal — etcd operations
$ ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint health
$ etcdctl endpoint status -w table
$ etcdctl snapshot save /backup/etcd-$(date +%F).db
$ etcdctl get /registry/ --prefix --keys-only | head$ oc get etcd -o yaml
→ Cluster etcd operator CR shows members, encryption, defrag schedule
$ oc adm etcd backup --backup-dir=/var/tmp/etcd-backup
$ oc get clusterversion -o jsonpath='{.items[0].status.conditions[?(@.type=="Progressing")].message}'

OpenShift: encryption at rest

OCP 4.10+ supports etcd encryption at rest via a Kubernetes encryption provider configured in the apiserver cluster resource. Secrets and ConfigMaps are encrypted in etcd; other resources remain plaintext. Enable during install or migration—plan key rotation with the etcd encryption KMS integration.

json — encryption configuration (conceptual)
{
  "apiVersion": "apiserver.config.openshift.io/v1",
  "kind": "Encryption",
  "spec": {
    "encryption": {
      "resources": [
        {
          "providers": [
            { "aescbc": { "keys": [{ "name": "key1", "secret": "..." }] } },
            { "identity": {} }
          ],
          "resources": ["secrets", "configmaps"]
        }
      ]
    }
  }
}
⚙️ Config

Monitor etcd_mvcc_db_total_size_in_bytes and etcd_server_has_leader. Alert at 80% of quota. Automate daily snapshots; test restore quarterly—an untested backup is wishful thinking.

📦 Real World

The most common production etcd incident is disk latency spike on a cloud volume—not CPU. Moving etcd to local SSD or dedicated instances often fixes mysterious API timeouts that look like "network issues."

Scheduler (kube-scheduler)

New pods have spec.nodeName empty. The scheduler picks a node through filter → score → bind. It does not start containers—that is the kubelet's job after binding.

Scheduling pipeline

  1. Queue — unschedulable pods enter an activeQ / backoffQ
  2. Filtering — eliminate nodes that cannot fit the pod (hard constraints)
  3. Scoring — rank remaining nodes (soft preferences)
  4. Binding — API PATCH sets pod.spec.nodeName (optimistic concurrency)
  5. Preemption — if no node fits, lower-priority pods may be evicted (optional)

Filter plugins (hard constraints)

Filter Rejects node when…
NodeSelector / NodeAffinity Labels don't match required rules
TaintToleration Pod lacks toleration for node taint
NodeResourcesFit Insufficient CPU/memory/ephemeral-storage
PodAffinity/ AntiAffinity Co-location rules cannot be satisfied
VolumeBinding PVC topology or immediate binding constraints fail
NodePorts Requested hostPort already in use on node

Score plugins (soft preferences)

  • LeastAllocated — spread load; prefer nodes with more free resources
  • MostAllocated — bin-pack; reduce fragmentation (common with cluster autoscaler)
  • ImageLocality — prefer nodes that already have the image pulled
  • NodeAffinity — weight preferred affinity terms
  • TopologySpread — balance across zones/racks
yaml — affinity, tolerations, topology spread
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-api
spec:
  replicas: 6
  selector:
    matchLabels:
      app: payment-api
  template:
    metadata:
      labels:
        app: payment-api
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: workload
                    operator: In
                    values: [compute]
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app: payment-api
                topologyKey: kubernetes.io/hostname
      tolerations:
        - key: dedicated
          operator: Equal
          value: payments
          effect: NoSchedule
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: payment-api
      containers:
        - name: api
          image: payment-api:3.2.0
          resources:
            requests:
              cpu: "1"
              memory: 2Gi

Custom schedulers and profiles

Set pod.spec.schedulerName to route pods to a custom scheduler deployment. K8s 1.19+ scheduling profiles let one scheduler binary run multiple plugin configurations. The descheduler (separate project) evicts pods to rebalance violated constraints after initial placement.

terminal — why is my pod pending?
$ kubectl describe pod payment-api-7d4f8b-xyz | tail -20
→ Events: 0/12 nodes available: 3 Insufficient memory, 2 node(s) had taint …
$ kubectl get events --field-selector involvedObject.name=payment-api-7d4f8b-xyz
$ kubectl logs -n kube-system -l component=kube-scheduler --tail=50$ oc describe pod payment-api-7d4f8b-xyz | grep -A5 Events
$ oc adm inspect ns/production --dest-dir=/tmp/inspect
💡 Pro Tip

kubectl describe pod Events are your first stop for Pending pods. "Insufficient cpu" means requests exceed allocatable—not limits. Fix requests or add nodes; tweaking limits alone won't schedule.

⚖️ Trade-off

MostAllocated maximizes node utilization but increases blast radius when a node fails. LeastAllocated + topology spread improves resilience at the cost of more nodes and cross-AZ traffic.

Controller Manager

Kubernetes is a collection of control loops. Each controller watches a resource type and reconciles actual state toward desired state. The kube-controller-manager bundles dozens of these loops in one binary.

Reconciliation loop pattern

Pseudocode every controller follows:

  1. Watch API for changes (or periodic resync)
  2. Enqueue object key into work queue
  3. Read current state from API
  4. Compare to desired spec; compute diff
  5. Emit create/update/delete calls to API server
  6. Requeue on error with exponential backoff
flowchart LR
  D["Deployment\nreplicas: 3"] --> RS["ReplicaSet controller\ncreates/updates RS"]
  RS --> RSC["ReplicaSet\nreplicas: 3"]
  RSC --> P["Pod controller\ncreates Pods"]
  P --> POD["3 Running Pods"]
  POD -->|"node failure"| P
  P -->|"recreate"| POD

Key controllers

Controller Watches Action
Deployment Deployment Manages ReplicaSets; rolling updates via RS scaling
ReplicaSet ReplicaSet Maintains pod count matching selector + replicas
StatefulSet StatefulSet Ordered pods with stable network ID + PVC templates
DaemonSet DaemonSet One pod per matching node (logging, CNI, kube-proxy)
Job / CronJob Job, CronJob Run-to-completion workloads; schedule CronJobs
Node Node Taint nodes on NotReady; evict pods after timeout
ServiceAccount ServiceAccount Auto-create token Secret (legacy) or TokenRequest
EndpointSlice Service, Pod Populate backend endpoints for Service VIP
Namespace Namespace Finalize deletion—remove all namespaced objects
PersistentVolume PV, PVC Bind claims; recycle/retain per policy

Leader election

Only one kube-controller-manager instance is active per cluster. Standby replicas compete via coordination.k8s.io/Lease objects—same pattern for scheduler and cloud-controller-manager. Loss of leader triggers failover within seconds; controllers resync from etcd watches.

yaml — Deployment triggers ReplicaSet reconciliation
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
        - name: nginx
          image: nginx:1.25
          ports:
            - containerPort: 80
terminal — controller observability
$ kubectl rollout status deployment/web
$ kubectl get rs -l app=web --show-labels
→ Two ReplicaSets during rollout: old + new
$ kubectl get lease -n kube-system | grep controller
$ kubectl scale deployment web --replicas=5$ oc rollout status deployment/web
$ oc get deploymentconfig 2>/dev/null || oc get deploy web -o yaml
→ OCP 4.16+ prefers Deployment; legacy DeploymentConfig still reconciled by OCP controller
🔬 Under the Hood

Controllers are level-triggered, not edge-triggered. If they miss an event, the periodic resync (default ~5 min) re-queues everything—self-healing without perfect reliability of the watch stream.

🎯 Interview Tip

"Deployment vs ReplicaSet?" — ReplicaSet ensures N pods exist. Deployment is a higher-level controller that owns ReplicaSets and implements rolling updates by creating a new RS and scaling down the old one—users rarely create ReplicaSets directly.

kubelet

The kubelet is the node agent. It watches the API for pods bound to its node, instructs the container runtime via CRI, runs probes, reports status, and enforces pod lifecycle on actual hardware.

Pod lifecycle on the node

  1. kubelet accepts pod spec (API-assigned or static manifest)
  2. Pull images via CRI if not cached
  3. Create pod sandbox (network namespace via CNI)
  4. Start init containers sequentially, then app containers
  5. Run liveness/readiness/startup probes
  6. Restart failed containers per restartPolicy
  7. Report PodStatus phases: Pending → Running → Succeeded/Failed

CRI path: containerd / CRI-O → runc

The kubelet speaks CRI (gRPC)—not Docker directly since K8s 1.24 dropped dockershim. containerd (default on many distros) or CRI-O (OpenShift default) pulls images, creates sandboxes, and invokes runc to start OCI bundles.

flowchart LR
  API["API Server"] -->|"pod spec"| KL["kubelet"]
  KL -->|"CRI gRPC"| RT["containerd / CRI-O"]
  RT --> RUN["runc"]
  RUN --> C["containers"]
  KL --> CNI["CNI ADD"]
  CNI --> C
  KL -->|"status"| API

Health probes

Probe Purpose Failure action
Startup Slow-starting apps (JVM warmup) Kill container; liveness disabled until success
Liveness Is the process deadlocked? Restart container
Readiness Can this instance receive traffic? Remove from Service endpoints (no restart)
yaml — probes and resources
apiVersion: v1
kind: Pod
metadata:
  name: api
spec:
  containers:
    - name: api
      image: api:2.1.0
      ports:
        - containerPort: 8080
      startupProbe:
        httpGet:
          path: /healthz
          port: 8080
        failureThreshold: 30
        periodSeconds: 10
      readinessProbe:
        httpGet:
          path: /ready
          port: 8080
        periodSeconds: 5
      livenessProbe:
        httpGet:
          path: /healthz
          port: 8080
        periodSeconds: 15
      resources:
        requests:
          cpu: 500m
          memory: 512Mi
        limits:
          memory: 1Gi

Node status, GC, and static pods

The kubelet posts NodeStatus (capacity, allocatable, conditions like Ready, DiskPressure). It garbage-collects dead containers and unused images. Static pods—manifests in /etc/kubernetes/manifests/—are managed locally; the API server mirrors them as mirror pods (how control plane components run on kubeadm clusters).

terminal — node and pod diagnostics
$ kubectl describe node worker-1 | grep -A10 Conditions
$ kubectl get pods -o wide --field-selector spec.nodeName=worker-1
$ crictl pods && crictl ps
$ journalctl -u kubelet -f$ oc describe node worker-1
→ containerRuntimeVersion shows cri-o://1.28.x on RHCOS
$ oc debug node/worker-1 -- chroot /host crictl ps
$ oc adm node-logs worker-1 --kubelet

OpenShift: CRI-O default

RHCOS nodes run CRI-O exclusively—no containerd option. Image pulls integrate with internal registry and ImageContentSourcePolicy for disconnected installs. Debugging uses oc debug node/... since SSH is disabled by default.

⚠️ Pitfall

A liveness probe hitting the same endpoint as readiness restarts pods under load when the app is merely slow—not dead. Use startup probes for JVM apps; keep liveness checks lightweight and distinct from readiness.

📦 Real World

ImagePullBackOff at the kubelet layer means registry auth, missing image, or rate limit—not a scheduler issue. Check crictl pull on the node and imagePullSecrets before chasing Deployment controllers.

kube-proxy

Services get a stable virtual IP (ClusterIP). kube-proxy programs node networking so traffic to that VIP reaches healthy pod backends. It implements Service abstraction—not pod-to-pod routing.

What kube-proxy does

Watches Service and EndpointSlice objects. For each Service, installs rules mapping ClusterIP:port → pod IPs (load-balanced). Also handles NodePort and externalIPs by binding host ports or accepting external traffic per mode.

Modes: iptables vs IPVS vs eBPF

Mode Mechanism Trade-offs
iptables (default legacy) Chain of NAT rules per Service Simple; O(n) rules scale poorly; random backend selection
IPVS Kernel IP Virtual Server Better scalability; L4 load balancing algorithms (rr, lc, dh)
eBPF / Cilium Bypass kube-proxy; Cilium replaces with eBPF maps Lower latency; unified policy + LB; requires Cilium dataplane
flowchart LR
  C["Client pod"] -->|"ClusterIP:80"| KP["kube-proxy rules\niptables/IPVS"]
  KP --> P1["Pod 10.0.1.5:8080"]
  KP --> P2["Pod 10.0.2.7:8080"]
  C2["Pod A"] -->|"direct pod IP"| CNI["CNI routing"]
  CNI --> P3["Pod B"]
  note["kube-proxy does NOT handle Pod A → Pod B"] --- CNI

What kube-proxy does NOT do

Pod-to-pod communication is handled by the CNI plugin (routing, overlay, eBPF). kube-proxy only intercepts traffic destined for Service VIPs. DNS resolution ( my-svc.namespace.svc.cluster.local) is CoreDNS—also separate.

yaml — Service and EndpointSlice
apiVersion: v1
kind: Service
metadata:
  name: web
spec:
  selector:
    app: web
  ports:
    - port: 80
      targetPort: 8080
  type: ClusterIP
---
apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
  name: web-abc
  labels:
    kubernetes.io/service-name: web
addressType: IPv4
ports:
  - port: 8080
    protocol: TCP
endpoints:
  - addresses: ["10.128.0.12"]
    conditions:
      ready: true
    targetRef:
      kind: Pod
      name: web-7f8d9-xk2lp
terminal — service dataplane
$ kubectl get svc web -o wide
$ kubectl get endpointslices -l kubernetes.io/service-name=web
$ kubectl get ds -n kube-system kube-proxy -o yaml | grep mode
$ iptables-save | grep web-cluster-ip$ oc get svc web
$ oc get network/cluster -o yaml
→ OVN-Kubernetes on OCP; kube-proxy may be disabled when eBPF/OVN handles Services
$ oc exec -it debug-pod -- curl -s http://web.default.svc:80/healthz
🔴 OpenShift

OCP 4.x defaults to OVN-Kubernetes CNI with distributed service load balancing. Depending on cluster version and network configuration, kube-proxy may not be the active dataplane—check the Network cluster operator before debugging iptables rules.

🔬 Under the Hood

EndpointSlice controller populates backends; kube-proxy reacts. If pods are Running but Service has no endpoints, check selector labels and readiness probes—not kube-proxy itself.

OpenShift Control Plane Additions

OpenShift wraps upstream Kubernetes with operators that manage cluster lifecycle, node OS, platform services, and enterprise integrations. Understanding these is essential for OCP operations and CKA-adjacent SRE work.

Machine Config Operator (MCO)

RHCOS nodes are immutable—no yum install or SSH patching. MCO renders MachineConfig objects into Ignition configs, drains nodes, and reboots to apply kernel args, kubelet settings, registry certs, and chrony configuration.

Cluster Version Operator (CVO)

CVO drives cluster upgrades—OCP x.y.z → next z-stream or minor version. It coordinates image updates across control plane operators, waits for health, and surfaces progress via ClusterVersion status. Blocked upgrades often mean ClusterOperators are Degraded.

Operator Lifecycle Manager (OLM)

OLM installs and upgrades cluster operators from OperatorHub—CSV (ClusterServiceVersion) lifecycle, CRD ownership, and dependency resolution. Platform teams install cert-manager, Service Mesh, or custom operators through OLM.

Image Registry Operator

Provides a default internal registry (image-registry.openshift-image-registry.svc:5000) or integrates with external S3/GCS. Manages TLS, storage PVC, and routing for oc import-image workflows.

Authentication Operator

Configures OAuth server, identity providers (LDAP, HTPasswd, OIDC), and console login. Integrates with Kubernetes RBAC via OpenShift groups mapped to ClusterRoleBindings.

RHCOS (Red Hat CoreOS)

Minimal, immutable OS purpose-built for containers. Nodes join via Ignition on first boot; updates ship as OSTree images applied by MCO during upgrades—same mechanism as Fedora CoreOS, enterprise-hardened for OCP.

flowchart TB
  CVO["Cluster Version Operator"] --> CO["ClusterOperators\n50+ platform operators"]
  MCO["Machine Config Operator"] --> RHCOS["RHCOS nodes\nIgnition + OSTree"]
  OLM["Operator Lifecycle Manager"] --> OH["OperatorHub CSVs"]
  AUTH["Authentication Operator"] --> OAuth["OAuth / IdP"]
  REG["Image Registry Operator"] --> IR["Internal registry"]
  CO --> API["kube-apiserver"]
  RHCOS --> KL["kubelet + CRI-O"]
yaml — MachineConfig pool (conceptual)
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: worker
spec:
  machineConfigSelector:
    matchLabels:
      machineconfiguration.openshift.io/role: worker
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker: ""
  paused: false
  maxUnavailable: 1
terminal — OCP control plane health
$ # vanilla K8s has no ClusterOperators — use component logs
$ kubectl get pods -n kube-system$ oc get clusteroperators
$ oc get clusterversion version
$ oc get mcp
→ MASTER/WORKER pools show UPDATED/DEGRADED/UPDATING
$ oc get co authentication image-registry kube-apiserver -o yaml
$ oc adm upgrade --to=4.14.12
💡 Pro Tip

Start every OCP incident with oc get co. If kube-apiserver or etcd operator is Degraded, workload symptoms are downstream. Fix platform operators before debugging application Deployments.

⚖️ Trade-off

OCP's opinionated stack (CRI-O, OVN, SCC, internal registry) reduces integration toil but increases migration friction from vanilla K8s manifests. Design for portable core APIs; isolate OCP-specific Routes and SCC grants.