Multi-Cluster & Advanced OpenShift

Multi-Cluster Patterns

Organizations outgrow a single cluster when availability, data residency, or team autonomy require separate control planes. Multi-cluster is not “Kubernetes but bigger”—it is a deliberate architecture of hubs, spokes, and traffic patterns with explicit failover semantics.

Why multi-cluster?

High availability (HA) — survive AZ, region, or cloud-provider outages by running workloads in more than one cluster; DNS/global load balancing steers traffic.
Disaster recovery (DR) — warm or cold standby cluster with replicated state (Velero, DB replication, object storage); RPO/RTO defined by how often you sync and how fast you cut over.
Compliance & isolation — PCI/HIPAA/GDPR data stays in-region clusters; production vs non-prod never share etcd; blast radius contained per tenant or business unit.
Scale & specialization — etcd and API server limits (~5k nodes practical per cluster); GPU/edge/batch clusters tuned independently.

Federation vs management hub

Kubernetes federation (historically KubeFed; today often GitOps + policy engines) pushes the same configuration to many clusters—namespaces, RBAC, apps—while each cluster retains its own control plane. A management hub (ACM, Rancher, fleet controllers) adds inventory, lifecycle, observability, and policy enforcement on top of that sync model. Federation answers “deploy everywhere”; the hub answers “who owns what, is it compliant, and is it healthy?”

Active-active vs active-passive

Pattern	Behavior	Trade-offs
Active-active	Traffic served from multiple clusters simultaneously; global LB or DNS weighted routing	Requires shared or replicated state (sessions, DB writes), conflict handling, consistent config across clusters
Active-passive	Primary cluster serves traffic; secondary on standby until failover	Simpler data model; lower cost; failover drill and RTO testing are mandatory
Active-passive (warm)	Standby runs scaled-down replicas; scale up on failover	Balance of cost and RTO—common for stateless tiers with external DB

flowchart TB
  subgraph mgmt["Management hub cluster"]
    ACM["Red Hat ACM\npolicies + GitOps"]
    OBS["Federated observability\nmetrics / logs / alerts"]
    GIT["Git / Argo CD\nApplicationSet"]
  end
  subgraph prod["Production clusters"]
    C1["Cluster A\nus-east active"]
    C2["Cluster B\neu-west active"]
  end
  subgraph dr["DR / edge"]
    C3["Cluster C\nus-west passive"]
    C4["Cluster D\nedge HCP"]
  end
  GLB["Global LB / DNS\nactive-active or failover"]
  GIT --> ACM
  ACM -->|ManagedCluster CR| C1
  ACM --> C2
  ACM --> C3
  ACM --> C4
  OBS --> C1
  OBS --> C2
  OBS --> C3
  GLB --> C1
  GLB --> C2
  C1 -.->|Velero / async repl| C3

$ kubectl config get-contexts
$ kubectl config use-context prod-us-east
$ kubectl get nodes -o wide
$ kubectl config use-context prod-eu-west
$ kubectl get deploy -A --field-selector metadata.namespace=team-payments$ oc config get-contexts
$ oc login --token=<token> --server=https://api.hub.example:6443
$ oc get managedcluster
$ oc get managedcluster prod-us-east -o jsonpath='{.status.conditions[?(@.type=="ManagedClusterConditionAvailable")].status}'

⚖️ Trade-off

One big cluster vs many small clusters: Fewer clusters mean simpler networking and lower ops overhead, but larger blast radius and harder compliance boundaries. Many clusters increase GitOps/policy complexity but isolate failures and let teams upgrade on different schedules.

⚠️ Pitfall

Active-active without solving data consistency causes split-brain writes and duplicate charges. Stateless tiers can go active-active early; stateful tiers usually stay active-passive until you have multi-primary DB or CRDT-style semantics.

🎯 Interview Tip

"When would you add a second cluster?" — Tie answer to HA/DR RTO/RPO, regulatory region, or etcd/API limits—not vanity. Mention hub-spoke governance (ACM), GitOps ApplicationSet for fan-out, and explicit active-active vs passive DR semantics.

📦 Real World

Retail platforms run active-active stateless checkout in two regions with a single primary PostgreSQL (RDS cross-region read replica for DR). Black Friday traffic uses global LB; DR cluster stays warm with HPA minReplicas=1 on critical Deployments.

Red Hat Advanced Cluster Management (ACM)

ACM is Red Hat's multicluster control plane for OpenShift and Kubernetes. Install it on a hub cluster; import or provision managed clusters as spokes. ACM centralizes lifecycle, policy, application delivery, and observability without merging etcd into one mega-cluster.

Hub + managed clusters

The hub runs MultiClusterHub (ACM operator) and stores ManagedCluster objects—one per imported spoke. Agents on each spoke (klusterlet) sync status upward and apply hub-directed configuration. Import via generated bootstrap script, auto-import on ROSA/ARO, or provision new clusters with Hive (below).

ManagedCluster — registration, labels (env, region, cost-center), availability conditions
ManagedClusterSet — group clusters for RBAC and policy targeting
Placement + PlacementRule — select clusters by labels for apps and policies

Hive lifecycle

Hive (bundled with ACM) provisions OpenShift clusters via ClusterDeployment CRs—cloud credentials, base domain, install-config, and worker pool sizes declared in YAML. Hive creates the install pod, waits for success, and hands off a registered ManagedCluster. Destroy by deleting the ClusterDeployment (with protection annotations for prod).

apiVersion: policy.open-cluster-management.io/v1
kind: Policy
metadata:
  name: require-network-policies
  namespace: policies
  annotations:
    policy.open-cluster-management.io/standards: NIST SP 800-53
spec:
  remediationAction: enforce          # inform | enforce
  disabled: false
  policy-templates:
    - objectDefinition:
        apiVersion: policy.open-cluster-management.io/v1
        kind: ConfigurationPolicy
        metadata:
          name: namespace-must-have-netpol
        spec:
          remediationAction: enforce
          severity: high
          namespaceSelector:
            include: ["team-*"]
          object-templates:
            - complianceType: musthave
              objectDefinition:
                apiVersion: networking.k8s.io/v1
                kind: NetworkPolicy
                metadata:
                  name: default-deny-ingress
                spec:
                  podSelector: {}
                  policyTypes:
                    - Ingress
---
apiVersion: policy.open-cluster-management.io/v1
kind: PlacementBinding
metadata:
  name: bind-netpol-policy
  namespace: policies
placementRef:
  name: prod-clusters
  kind: Placement
  apiGroup: cluster.open-cluster-management.io
subjects:
  - name: require-network-policies
    kind: Policy
    apiGroup: policy.open-cluster-management.io

Policy enforcement

ACM Policy objects wrap templates (ConfigurationPolicy, Gatekeeper constraints, CertificatePolicy, IamPolicy). Policies bind to clusters via PlacementBinding. remediationAction: inform reports violations in the console; enforce creates/remediates resources on spokes. PolicyReport status aggregates compliance per cluster.

ApplicationSet via ACM

ACM integrates with Argo CD (OpenShift GitOps) using multicluster ApplicationSets—generators read cluster labels from the hub and render one Argo Application per matching spoke. Push once to git; every prod cluster in region=us receives the same manifest with cluster-specific overlays.

Federated observability

ACM observability addon deploys metrics collectors on spokes and forwards to Thanos/Prometheus on the hub— unified dashboards, alert routing, and SLO views across the fleet. Correlate with Red Hat Insights or your existing Grafana stack. Spoke telemetry stays label-scoped by ManagedCluster name for drill-down.

$ # ACM is OpenShift-first; vanilla K8s uses open-cluster-management community
$ kubectl get managedcluster
$ kubectl get policy -A
$ kubectl get configurationpolicy -A$ oc get multiclusterhub -n open-cluster-management
$ oc get managedcluster
$ oc get clusterdeployment -A
$ oc get policy -n policies
$ oc get placement -A
$ oc get multiclusterobservability

🔴 OpenShift

Install ACM from OperatorHub on a dedicated hub cluster (not a production workload cluster). Red Hat recommends hub sizing for 100+ managed clusters—separate etcd from spoke workloads. Use ClusterManager on OpenShift 4.14+ for integrated lifecycle or full ACM for policy/GitOps.

🔬 Under the Hood

The klusterlet on each spoke registers with the hub's ManagedCluster API. Policy controller watches hub Policy CRs and applies templates via the work API—status flows back as PolicyReport. No central kube-apiserver proxy; each spoke retains its own API server for workload isolation.

⚙️ Config

Label managed clusters at import: environment=prod, region=us-east, pci=true. Placement rules select on these labels—avoid hardcoding cluster names in ApplicationSet generators.

🔒 Security

Hub credentials are crown jewels—protect hub RBAC, enable audit logging, and rotate import tokens. Use ManagedClusterSet RBAC so team A cannot view or policy-bind team B's clusters.

💡 Pro Tip

Start policies in inform mode for two release cycles—fix drift on spokes before switching to enforce. Sudden enforce on NetworkPolicy can drop legitimate traffic.

OpenShift Hosted Control Planes (HCP)

Traditional OpenShift runs the control plane (API server, etcd, controllers) on dedicated master nodes inside each cluster. Hosted control planes move that control plane into pods on a separate management / hosting cluster—workers remain on the data plane cluster, but masters are virtualized.

Control plane as pods on the management cluster

Each hosted cluster gets a namespace (or dedicated segment) on the management cluster containing etcd, kube-apiserver, kube-controller-manager, and OpenShift-specific operators as Deployments/StatefulSets. Worker nodes join via HostedCluster CR; they never run etcd locally. Blast radius: compromise of one worker does not expose etcd on the same node.

HyperShift

HyperShift is the upstream project (CNCF) implementing hosted control planes for OpenShift. The HyperShift operator on the management cluster reconciles HostedCluster and NodePool objects—scaling worker pools independently of control plane count. Supported on AWS, Azure, GCP, bare metal (provider matrix evolves per OCP release).

ROSA HCP

Red Hat OpenShift Service on AWS (ROSA) with HCP is the managed offering: Red Hat operates the management cluster; customers get fast cluster create/delete, per-cluster billing, and fewer EC2 instances (no 3+ dedicated masters per cluster). Ideal for SaaS multi-tenancy, ephemeral preview environments, and edge-style footprints.

Cost & speed benefits

Dimension	Traditional OCP	Hosted control planes
Control plane cost	3+ large instances per cluster 24/7	Shared management pool amortized across many hosted clusters
Cluster create time	~45–60 min full install	Often ~10–15 min—control plane pods schedule immediately
Density	One control plane per cluster	Hundreds of hosted clusters on one management cluster (within limits)
Upgrade model	CVO per cluster	Management operator rolls control plane versions; NodePools can stagger workers

flowchart TB
  subgraph mgmt["Management cluster (HyperShift)"]
    HSO["HyperShift operator"]
    subgraph hc["Namespace: hosted-cluster-1"]
      ETCD["etcd StatefulSet"]
      API["kube-apiserver"]
      CPO["Cluster operators\nas pods"]
    end
  end
  subgraph workers["Data plane — hosted cluster 1"]
    NP["NodePool workers"]
    WL["User workloads"]
  end
  HSO --> hc
  API --> NP
  NP --> WL

$ # HyperShift CRs live on management cluster — use hosted kubeconfig for workloads
$ kubectl get hostedcluster -A
$ kubectl get nodepool -A
$ kubectl get pods -n clusters-hosted-cluster-1$ oc get hostedcluster -A
$ oc get nodepool -A
$ oc hypershift create cluster aws --name tenant-a --node-pool-replicas 3
$ oc get kubeconfig -n clusters-tenant-a -o yaml
$ # ROSA HCP — ocm CLI
$ rosa create cluster --hosted-cp --region us-east-1

⚖️ Trade-off

HCP vs traditional: HCP wins on density and provisioning speed; traditional wins when you need full isolation of control plane hardware, air-gapped installs, or providers without HCP support. Management cluster failure affects many hosted clusters—design HA management tier first.

⚠️ Pitfall

Under-provisioned management clusters cause API latency across all hosted clusters. Monitor etcd and apiserver SLOs on the management plane separately from tenant workload metrics.

📦 Real World

SaaS vendors on ROSA HCP spin up one hosted cluster per enterprise customer—delete the HostedCluster when the trial ends without orphaned master EC2 instances.

🎯 Interview Tip

Explain HCP as "control plane virtualization"—compare to nested virtualization vs bare metal masters. Mention HyperShift operator, NodePool worker scaling, and why ROSA HCP reduces per-cluster infrastructure cost.

Cluster API (CAPI)

Cluster API is a Kubernetes-native project for declarative cluster lifecycle—create, scale, upgrade, and delete clusters using CRDs and controllers, the same reconciliation model as Deployments for pods. Infrastructure providers plug in AWS, Azure, GCP, vSphere, and bare metal.

Declarative cluster lifecycle

CAPI separates concerns into core, bootstrap, control plane, and infrastructure providers:

Cluster — links control plane + infrastructure; sets Kubernetes version
Machine — one node; created/deleted by controllers
MachineDeployment — desired worker count, rollout strategy (like Deployment → ReplicaSet → Pod)
MachineSet — immutable worker template snapshot
KubeadmControlPlane — managed control plane nodes on vanilla K8s installs

Providers: AWS / Azure / GCP / vSphere

Provider	Infrastructure CRDs	Notes
CAPA (AWS)	AWSCluster, AWSMachine	EC2 instances, ELB for API; common on EKS-adjacent self-managed clusters
CAPZ (Azure)	AzureCluster, AzureMachine	VMSS, Azure LB; integrates with Azure AD workload identity patterns
CAPG (GCP)	GCPCluster, GCPMachine	GCE instances, GCP load balancers
CAPV (vSphere)	VSphereCluster, VSphereMachine	On-prem standard—VM templates, vCenter credentials via Secret

MachineDeployment

Scale workers by editing spec.replicas on MachineDeployment. Rolling upgrades change the MachineTemplate reference—controller creates new Machines, drains old nodes, respects maxSurge / maxUnavailable. Same mental model as Deployment rollouts, but nodes instead of pods.

apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
  name: prod-workers
  namespace: cluster-prod
spec:
  clusterName: prod-us-east
  replicas: 5
  selector:
    matchLabels:
      cluster.x-k8s.io/cluster-name: prod-us-east
      pool: workers
  template:
    metadata:
      labels:
        cluster.x-k8s.io/cluster-name: prod-us-east
        pool: workers
    spec:
      clusterName: prod-us-east
      version: v1.29.4
      bootstrap:
        configRef:
          name: prod-workers-bootstrap
          kind: KubeadmConfigTemplate
          apiGroup: bootstrap.cluster.x-k8s.io
      infrastructureRef:
        name: prod-workers-infra
        kind: AWSMachineTemplate
        apiGroup: infrastructure.cluster.x-k8s.io
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0

GitOps compatible

Store Cluster, MachineDeployment, and provider templates in git; Argo CD or Flux on the management cluster reconciles them. ClusterClass (CAPI v1beta1) defines reusable templates—platform teams expose a small set of approved cluster shapes; app teams request via PR. Hive/ACM overlap: ACM uses Hive for OpenShift; CAPI is cloud-neutral and upstream-K8s-first.

$ clusterctl init --infrastructure aws
$ kubectl get clusters -A
$ kubectl get machinedeployment -A
$ kubectl get machines -A
$ clusterctl move --to-kubeconfig target-mgmt.kubeconfig
→ migrate CAPI objects to a new management cluster$ # OpenShift uses Hive/ACM for OCP lifecycle; CAPI common for upstream K8s on OCP management
$ oc get clusters.cluster.x-k8s.io -A
$ oc get machinedeployment -A
$ oc describe cluster prod-us-east -n cluster-prod

🔬 Under the Hood

CAPI controllers run on the management cluster, not on the workload cluster being created. The bootstrap provider (kubeadm) generates cloud-init/ignition; the infra provider creates VMs and joins them. clusterctl installs provider components and handles version compatibility matrices.

⚙️ Config

Pin provider versions in clusterctl.yaml and test upgrades on a sandbox management cluster first. Use MachineHealthCheck to auto-remediate unhealthy nodes—same philosophy as Pod disruption budgets.

⚠️ Pitfall

Deleting a Cluster CR without finalizer awareness orphan cloud VMs and cost money. Always verify kubectl get awscluster (or provider equivalent) reaches deleted state; use cloud tags for cleanup automation.

💡 Pro Tip

For OpenShift on AWS, many teams choose ROSA/ROSA HCP over self-managed CAPI+OCP install—CAPI shines when you need upstream Kubernetes uniformity across clouds with one GitOps repo.

Service Mesh (Istio / OpenShift Service Mesh)

A service mesh adds L7 traffic management, security, and observability via data-plane proxies (Envoy) colocated with workloads. Control plane (istiod / OSSM operator) pushes config; sidecars intercept east-west and ingress traffic without app code changes.

Sidecar injection & Envoy

Enable injection per namespace (istio-injection=enabled or OSSM labels). The mutating webhook adds an Envoy sidecar container to each pod—iptables/eBPF redirects traffic through Envoy. Sidecars handle retries, timeouts, mTLS, and telemetry export. Cost: ~50–100MB RAM per sidecar; plan node capacity accordingly.

VirtualService / DestinationRule / Gateway

VirtualService — route rules: match URI/headers, split traffic by weight (canary), retries, timeouts, fault injection
DestinationRule — policies applied after routing: subsets (v1/v2), load balancing (LEAST_CONN), connection pool, outlier detection
Gateway — north-south entry (often with OpenShift Route or cloud LB); binds external ports to internal VirtualServices

apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: payments-api
  namespace: team-payments
spec:
  host: payments-api.team-payments.svc.cluster.local
  trafficPolicy:
    tls:
      mode: ISTIO_MUTUAL
  subsets:
    - name: v1
      labels:
        version: v1
    - name: v2
      labels:
        version: v2
---
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: payments-api
  namespace: team-payments
spec:
  hosts:
    - payments-api.team-payments.svc.cluster.local
  http:
    - route:
        - destination:
            host: payments-api
            subset: v1
          weight: 90
        - destination:
            host: payments-api
            subset: v2
          weight: 10
      retries:
        attempts: 3
        perTryTimeout: 2s

mTLS

Istio/OSSM issues workload certificates via the control plane CA. PeerAuthentication sets mesh-wide or namespace STRICT mTLS—plaintext pod-to-pod traffic rejected. DestinationRule.tls.mode: ISTIO_MUTUAL ensures client sidecars present valid certs. Start with PERMISSIVE during migration, then enforce STRICT.

OSSM vs upstream Istio (Maistra)

OpenShift Service Mesh (OSSM) is Red Hat's supported distribution built on upstream Istio, historically via the Maistra fork (OpenShift 4.x integration layer). OSSM adds:

Operator lifecycle tied to OCP versions and Red Hat support
Integration with OpenShift Routes, SCCs, and platform monitoring
Curated Istio version + CVE backports between upstream releases
Upstream Istio direct install possible but unsupported on OCP for production

ServiceMeshMemberRoll

On OpenShift, the control plane lives in istio-system (or openshift-operators namespace depending on version). Tenant namespaces join the mesh via ServiceMeshMemberRoll—lists namespaces allowed to use the shared control plane and receive sidecar injection.

apiVersion: maistra.io/v1
kind: ServiceMeshMemberRoll
metadata:
  name: default
  namespace: istio-system
spec:
  members:
    - team-payments
    - team-checkout
    - team-identity
---
apiVersion: v1
kind: Namespace
metadata:
  name: team-payments
  labels:
    istio-injection: enabled

Kiali + Jaeger bundled

OSSM ships optional addons: Kiali (service graph, health, config validation) and Jaeger (distributed tracing for mesh spans). Enable via ServiceMeshControlPlane / Istio CR addon components. Traces complement Prometheus RED metrics from Envoy—see retry storms and upstream 503 sources in one view.

flowchart LR
  subgraph cp["Control plane"]
    ISTIOD["istiod / OSSM operator"]
    KIALI["Kiali"]
    JAEG["Jaeger"]
  end
  subgraph ns["Namespace: team-payments"]
    P1["payments-api\n+ Envoy sidecar"]
    P2["ledger-svc\n+ Envoy sidecar"]
  end
  GW["Gateway / Route"]
  ISTIOD --> P1
  ISTIOD --> P2
  GW --> P1
  P1 -->|mTLS| P2
  P1 --> JAEG
  P2 --> JAEG
  KIALI --> P1

$ kubectl get ns --show-labels | grep istio-injection
$ kubectl label namespace team-payments istio-injection=enabled --overwrite
$ kubectl get virtualservice,destinationrule,gateway -n team-payments
$ kubectl exec deploy/payments-api -n team-payments -c istio-proxy -- pilot-agent request GET config_dump
$ istioctl analyze -n team-payments$ oc get smcp -n istio-system
$ oc get servicemeshmemberroll -n istio-system -o yaml
$ oc get servicemeshmember -A
$ oc get route -n istio-system kiali
$ oc get route -n istio-system jaeger
$ oc adm policy add-scc-to-user anyuid -z default -n team-payments
→ only if sidecar UID conflicts; prefer restricted SCC + proper securityContext

🔴 OpenShift

Install OSSM via OperatorHub (Kiali Operator, Servicemesh Operator, Jaeger Operator). Use ServiceMeshControlPlane v2 CR—not manual istioctl install on production OCP. Add namespaces to ServiceMeshMemberRoll before enabling injection.

🔒 Security

Move to PeerAuthentication: STRICT after validating all clients use sidecars or mesh gateways. AuthorizationPolicy CRs enforce L7 allow/deny (JWT claims, paths)—network policy alone is insufficient for HTTP semantics.

⚠️ Pitfall

Headless Services, raw TCP databases, and Jobs that ignore sidecar shutdown can hang with Istio sidecars. Use holdApplicationUntilProxyStarts and sidecar exit hooks, or exclude ports via annotation traffic.sidecar.istio.io/excludeOutboundPorts.

⚖️ Trade-off

Mesh vs ingress + network policy: Mesh adds mTLS, L7 routing, and deep telemetry at CPU/RAM cost. Simple north-south APIs may need only Ingress/Route + NetworkPolicy; mesh pays off with many east-west microservices and canary requirements.

🎯 Interview Tip

Draw data plane (Envoy sidecar) vs control plane (istiod). Explain VirtualService vs DestinationRule split. On OCP, mention ServiceMeshMemberRoll and why OSSM/Maistra exists (support matrix, SCC, Routes).

💡 Pro Tip

Run istioctl analyze (or Kiali validation) in CI on mesh YAML before merge— misconfigured VirtualServices fail silently with unexpected 404/503 at runtime.