Multi-Cluster & Advanced OpenShift
One cluster is a starting point. Production platforms span regions, tenants, and compliance boundaries— each with its own control plane, upgrade cadence, and blast radius. This chapter covers why and how to operate many clusters: federation patterns, Red Hat ACM governance, HyperShift hosted control planes, Cluster API provisioning, and service mesh for cross-cluster traffic and zero-trust networking.
Multi-Cluster Patterns
Organizations outgrow a single cluster when availability, data residency, or team autonomy require separate control planes. Multi-cluster is not “Kubernetes but bigger”—it is a deliberate architecture of hubs, spokes, and traffic patterns with explicit failover semantics.
Why multi-cluster?
- High availability (HA) — survive AZ, region, or cloud-provider outages by running workloads in more than one cluster; DNS/global load balancing steers traffic.
- Disaster recovery (DR) — warm or cold standby cluster with replicated state (Velero, DB replication, object storage); RPO/RTO defined by how often you sync and how fast you cut over.
- Compliance & isolation — PCI/HIPAA/GDPR data stays in-region clusters; production vs non-prod never share etcd; blast radius contained per tenant or business unit.
- Scale & specialization — etcd and API server limits (~5k nodes practical per cluster); GPU/edge/batch clusters tuned independently.
Federation vs management hub
Kubernetes federation (historically KubeFed; today often GitOps + policy engines) pushes the same configuration to many clusters—namespaces, RBAC, apps—while each cluster retains its own control plane. A management hub (ACM, Rancher, fleet controllers) adds inventory, lifecycle, observability, and policy enforcement on top of that sync model. Federation answers “deploy everywhere”; the hub answers “who owns what, is it compliant, and is it healthy?”
Active-active vs active-passive
| Pattern | Behavior | Trade-offs |
|---|---|---|
| Active-active | Traffic served from multiple clusters simultaneously; global LB or DNS weighted routing | Requires shared or replicated state (sessions, DB writes), conflict handling, consistent config across clusters |
| Active-passive | Primary cluster serves traffic; secondary on standby until failover | Simpler data model; lower cost; failover drill and RTO testing are mandatory |
| Active-passive (warm) | Standby runs scaled-down replicas; scale up on failover | Balance of cost and RTO—common for stateless tiers with external DB |
flowchart TB
subgraph mgmt["Management hub cluster"]
ACM["Red Hat ACM\npolicies + GitOps"]
OBS["Federated observability\nmetrics / logs / alerts"]
GIT["Git / Argo CD\nApplicationSet"]
end
subgraph prod["Production clusters"]
C1["Cluster A\nus-east active"]
C2["Cluster B\neu-west active"]
end
subgraph dr["DR / edge"]
C3["Cluster C\nus-west passive"]
C4["Cluster D\nedge HCP"]
end
GLB["Global LB / DNS\nactive-active or failover"]
GIT --> ACM
ACM -->|ManagedCluster CR| C1
ACM --> C2
ACM --> C3
ACM --> C4
OBS --> C1
OBS --> C2
OBS --> C3
GLB --> C1
GLB --> C2
C1 -.->|Velero / async repl| C3
$ kubectl config get-contexts $ kubectl config use-context prod-us-east $ kubectl get nodes -o wide $ kubectl config use-context prod-eu-west $ kubectl get deploy -A --field-selector metadata.namespace=team-payments$ oc config get-contexts $ oc login --token=<token> --server=https://api.hub.example:6443 $ oc get managedcluster $ oc get managedcluster prod-us-east -o jsonpath='{.status.conditions[?(@.type=="ManagedClusterConditionAvailable")].status}'
One big cluster vs many small clusters: Fewer clusters mean simpler networking and lower ops overhead, but larger blast radius and harder compliance boundaries. Many clusters increase GitOps/policy complexity but isolate failures and let teams upgrade on different schedules.
Active-active without solving data consistency causes split-brain writes and duplicate charges. Stateless tiers can go active-active early; stateful tiers usually stay active-passive until you have multi-primary DB or CRDT-style semantics.
"When would you add a second cluster?" — Tie answer to HA/DR RTO/RPO, regulatory region, or etcd/API limits—not vanity. Mention hub-spoke governance (ACM), GitOps ApplicationSet for fan-out, and explicit active-active vs passive DR semantics.
Retail platforms run active-active stateless checkout in two regions with a single primary PostgreSQL (RDS cross-region read replica for DR). Black Friday traffic uses global LB; DR cluster stays warm with HPA minReplicas=1 on critical Deployments.
Red Hat Advanced Cluster Management (ACM)
ACM is Red Hat's multicluster control plane for OpenShift and Kubernetes. Install it on a hub cluster; import or provision managed clusters as spokes. ACM centralizes lifecycle, policy, application delivery, and observability without merging etcd into one mega-cluster.
Hub + managed clusters
The hub runs MultiClusterHub (ACM operator) and stores ManagedCluster objects—one per imported spoke. Agents on each spoke (klusterlet) sync status upward and apply hub-directed configuration. Import via generated bootstrap script, auto-import on ROSA/ARO, or provision new clusters with Hive (below).
- ManagedCluster — registration, labels (env, region, cost-center), availability conditions
- ManagedClusterSet — group clusters for RBAC and policy targeting
- Placement + PlacementRule — select clusters by labels for apps and policies
Hive lifecycle
Hive (bundled with ACM) provisions OpenShift clusters via ClusterDeployment CRs—cloud credentials, base domain, install-config, and worker pool sizes declared in YAML. Hive creates the install pod, waits for success, and hands off a registered ManagedCluster. Destroy by deleting the ClusterDeployment (with protection annotations for prod).
apiVersion: policy.open-cluster-management.io/v1
kind: Policy
metadata:
name: require-network-policies
namespace: policies
annotations:
policy.open-cluster-management.io/standards: NIST SP 800-53
spec:
remediationAction: enforce # inform | enforce
disabled: false
policy-templates:
- objectDefinition:
apiVersion: policy.open-cluster-management.io/v1
kind: ConfigurationPolicy
metadata:
name: namespace-must-have-netpol
spec:
remediationAction: enforce
severity: high
namespaceSelector:
include: ["team-*"]
object-templates:
- complianceType: musthave
objectDefinition:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
spec:
podSelector: {}
policyTypes:
- Ingress
---
apiVersion: policy.open-cluster-management.io/v1
kind: PlacementBinding
metadata:
name: bind-netpol-policy
namespace: policies
placementRef:
name: prod-clusters
kind: Placement
apiGroup: cluster.open-cluster-management.io
subjects:
- name: require-network-policies
kind: Policy
apiGroup: policy.open-cluster-management.io
Policy enforcement
ACM Policy objects wrap templates (ConfigurationPolicy, Gatekeeper constraints, CertificatePolicy, IamPolicy). Policies bind to clusters via PlacementBinding. remediationAction: inform reports violations in the console; enforce creates/remediates resources on spokes. PolicyReport status aggregates compliance per cluster.
ApplicationSet via ACM
ACM integrates with Argo CD (OpenShift GitOps) using multicluster ApplicationSets—generators read cluster labels from the hub and render one Argo Application per matching spoke. Push once to git; every prod cluster in region=us receives the same manifest with cluster-specific overlays.
Federated observability
ACM observability addon deploys metrics collectors on spokes and forwards to Thanos/Prometheus on the hub— unified dashboards, alert routing, and SLO views across the fleet. Correlate with Red Hat Insights or your existing Grafana stack. Spoke telemetry stays label-scoped by ManagedCluster name for drill-down.
$ # ACM is OpenShift-first; vanilla K8s uses open-cluster-management community $ kubectl get managedcluster $ kubectl get policy -A $ kubectl get configurationpolicy -A$ oc get multiclusterhub -n open-cluster-management $ oc get managedcluster $ oc get clusterdeployment -A $ oc get policy -n policies $ oc get placement -A $ oc get multiclusterobservability
Install ACM from OperatorHub on a dedicated hub cluster (not a production workload cluster). Red Hat recommends hub sizing for 100+ managed clusters—separate etcd from spoke workloads. Use ClusterManager on OpenShift 4.14+ for integrated lifecycle or full ACM for policy/GitOps.
The klusterlet on each spoke registers with the hub's ManagedCluster API. Policy controller watches hub Policy CRs and applies templates via the work API—status flows back as PolicyReport. No central kube-apiserver proxy; each spoke retains its own API server for workload isolation.
Label managed clusters at import: environment=prod, region=us-east, pci=true. Placement rules select on these labels—avoid hardcoding cluster names in ApplicationSet generators.
Hub credentials are crown jewels—protect hub RBAC, enable audit logging, and rotate import tokens. Use ManagedClusterSet RBAC so team A cannot view or policy-bind team B's clusters.
Start policies in inform mode for two release cycles—fix drift on spokes before switching to enforce. Sudden enforce on NetworkPolicy can drop legitimate traffic.
OpenShift Hosted Control Planes (HCP)
Traditional OpenShift runs the control plane (API server, etcd, controllers) on dedicated master nodes inside each cluster. Hosted control planes move that control plane into pods on a separate management / hosting cluster—workers remain on the data plane cluster, but masters are virtualized.
Control plane as pods on the management cluster
Each hosted cluster gets a namespace (or dedicated segment) on the management cluster containing etcd, kube-apiserver, kube-controller-manager, and OpenShift-specific operators as Deployments/StatefulSets. Worker nodes join via HostedCluster CR; they never run etcd locally. Blast radius: compromise of one worker does not expose etcd on the same node.
HyperShift
HyperShift is the upstream project (CNCF) implementing hosted control planes for OpenShift. The HyperShift operator on the management cluster reconciles HostedCluster and NodePool objects—scaling worker pools independently of control plane count. Supported on AWS, Azure, GCP, bare metal (provider matrix evolves per OCP release).
ROSA HCP
Red Hat OpenShift Service on AWS (ROSA) with HCP is the managed offering: Red Hat operates the management cluster; customers get fast cluster create/delete, per-cluster billing, and fewer EC2 instances (no 3+ dedicated masters per cluster). Ideal for SaaS multi-tenancy, ephemeral preview environments, and edge-style footprints.
Cost & speed benefits
| Dimension | Traditional OCP | Hosted control planes |
|---|---|---|
| Control plane cost | 3+ large instances per cluster 24/7 | Shared management pool amortized across many hosted clusters |
| Cluster create time | ~45–60 min full install | Often ~10–15 min—control plane pods schedule immediately |
| Density | One control plane per cluster | Hundreds of hosted clusters on one management cluster (within limits) |
| Upgrade model | CVO per cluster | Management operator rolls control plane versions; NodePools can stagger workers |
flowchart TB
subgraph mgmt["Management cluster (HyperShift)"]
HSO["HyperShift operator"]
subgraph hc["Namespace: hosted-cluster-1"]
ETCD["etcd StatefulSet"]
API["kube-apiserver"]
CPO["Cluster operators\nas pods"]
end
end
subgraph workers["Data plane — hosted cluster 1"]
NP["NodePool workers"]
WL["User workloads"]
end
HSO --> hc
API --> NP
NP --> WL
$ # HyperShift CRs live on management cluster — use hosted kubeconfig for workloads $ kubectl get hostedcluster -A $ kubectl get nodepool -A $ kubectl get pods -n clusters-hosted-cluster-1$ oc get hostedcluster -A $ oc get nodepool -A $ oc hypershift create cluster aws --name tenant-a --node-pool-replicas 3 $ oc get kubeconfig -n clusters-tenant-a -o yaml $ # ROSA HCP — ocm CLI $ rosa create cluster --hosted-cp --region us-east-1
HCP vs traditional: HCP wins on density and provisioning speed; traditional wins when you need full isolation of control plane hardware, air-gapped installs, or providers without HCP support. Management cluster failure affects many hosted clusters—design HA management tier first.
Under-provisioned management clusters cause API latency across all hosted clusters. Monitor etcd and apiserver SLOs on the management plane separately from tenant workload metrics.
SaaS vendors on ROSA HCP spin up one hosted cluster per enterprise customer—delete the HostedCluster when the trial ends without orphaned master EC2 instances.
Explain HCP as "control plane virtualization"—compare to nested virtualization vs bare metal masters. Mention HyperShift operator, NodePool worker scaling, and why ROSA HCP reduces per-cluster infrastructure cost.
Cluster API (CAPI)
Cluster API is a Kubernetes-native project for declarative cluster lifecycle—create, scale, upgrade, and delete clusters using CRDs and controllers, the same reconciliation model as Deployments for pods. Infrastructure providers plug in AWS, Azure, GCP, vSphere, and bare metal.
Declarative cluster lifecycle
CAPI separates concerns into core, bootstrap, control plane, and infrastructure providers:
- Cluster — links control plane + infrastructure; sets Kubernetes version
- Machine — one node; created/deleted by controllers
- MachineDeployment — desired worker count, rollout strategy (like Deployment → ReplicaSet → Pod)
- MachineSet — immutable worker template snapshot
- KubeadmControlPlane — managed control plane nodes on vanilla K8s installs
Providers: AWS / Azure / GCP / vSphere
| Provider | Infrastructure CRDs | Notes |
|---|---|---|
| CAPA (AWS) | AWSCluster, AWSMachine | EC2 instances, ELB for API; common on EKS-adjacent self-managed clusters |
| CAPZ (Azure) | AzureCluster, AzureMachine | VMSS, Azure LB; integrates with Azure AD workload identity patterns |
| CAPG (GCP) | GCPCluster, GCPMachine | GCE instances, GCP load balancers |
| CAPV (vSphere) | VSphereCluster, VSphereMachine | On-prem standard—VM templates, vCenter credentials via Secret |
MachineDeployment
Scale workers by editing spec.replicas on MachineDeployment. Rolling upgrades change the MachineTemplate reference—controller creates new Machines, drains old nodes, respects maxSurge / maxUnavailable. Same mental model as Deployment rollouts, but nodes instead of pods.
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
name: prod-workers
namespace: cluster-prod
spec:
clusterName: prod-us-east
replicas: 5
selector:
matchLabels:
cluster.x-k8s.io/cluster-name: prod-us-east
pool: workers
template:
metadata:
labels:
cluster.x-k8s.io/cluster-name: prod-us-east
pool: workers
spec:
clusterName: prod-us-east
version: v1.29.4
bootstrap:
configRef:
name: prod-workers-bootstrap
kind: KubeadmConfigTemplate
apiGroup: bootstrap.cluster.x-k8s.io
infrastructureRef:
name: prod-workers-infra
kind: AWSMachineTemplate
apiGroup: infrastructure.cluster.x-k8s.io
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
GitOps compatible
Store Cluster, MachineDeployment, and provider templates in git; Argo CD or Flux on the management cluster reconciles them. ClusterClass (CAPI v1beta1) defines reusable templates—platform teams expose a small set of approved cluster shapes; app teams request via PR. Hive/ACM overlap: ACM uses Hive for OpenShift; CAPI is cloud-neutral and upstream-K8s-first.
$ clusterctl init --infrastructure aws $ kubectl get clusters -A $ kubectl get machinedeployment -A $ kubectl get machines -A $ clusterctl move --to-kubeconfig target-mgmt.kubeconfig → migrate CAPI objects to a new management cluster$ # OpenShift uses Hive/ACM for OCP lifecycle; CAPI common for upstream K8s on OCP management $ oc get clusters.cluster.x-k8s.io -A $ oc get machinedeployment -A $ oc describe cluster prod-us-east -n cluster-prod
CAPI controllers run on the management cluster, not on the workload cluster being created. The bootstrap provider (kubeadm) generates cloud-init/ignition; the infra provider creates VMs and joins them. clusterctl installs provider components and handles version compatibility matrices.
Pin provider versions in clusterctl.yaml and test upgrades on a sandbox management cluster first. Use MachineHealthCheck to auto-remediate unhealthy nodes—same philosophy as Pod disruption budgets.
Deleting a Cluster CR without finalizer awareness orphan cloud VMs and cost money. Always verify kubectl get awscluster (or provider equivalent) reaches deleted state; use cloud tags for cleanup automation.
For OpenShift on AWS, many teams choose ROSA/ROSA HCP over self-managed CAPI+OCP install—CAPI shines when you need upstream Kubernetes uniformity across clouds with one GitOps repo.
Service Mesh (Istio / OpenShift Service Mesh)
A service mesh adds L7 traffic management, security, and observability via data-plane proxies (Envoy) colocated with workloads. Control plane (istiod / OSSM operator) pushes config; sidecars intercept east-west and ingress traffic without app code changes.
Sidecar injection & Envoy
Enable injection per namespace (istio-injection=enabled or OSSM labels). The mutating webhook adds an Envoy sidecar container to each pod—iptables/eBPF redirects traffic through Envoy. Sidecars handle retries, timeouts, mTLS, and telemetry export. Cost: ~50–100MB RAM per sidecar; plan node capacity accordingly.
VirtualService / DestinationRule / Gateway
- VirtualService — route rules: match URI/headers, split traffic by weight (canary), retries, timeouts, fault injection
- DestinationRule — policies applied after routing: subsets (v1/v2), load balancing (LEAST_CONN), connection pool, outlier detection
- Gateway — north-south entry (often with OpenShift Route or cloud LB); binds external ports to internal VirtualServices
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
name: payments-api
namespace: team-payments
spec:
host: payments-api.team-payments.svc.cluster.local
trafficPolicy:
tls:
mode: ISTIO_MUTUAL
subsets:
- name: v1
labels:
version: v1
- name: v2
labels:
version: v2
---
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
name: payments-api
namespace: team-payments
spec:
hosts:
- payments-api.team-payments.svc.cluster.local
http:
- route:
- destination:
host: payments-api
subset: v1
weight: 90
- destination:
host: payments-api
subset: v2
weight: 10
retries:
attempts: 3
perTryTimeout: 2s
mTLS
Istio/OSSM issues workload certificates via the control plane CA. PeerAuthentication sets mesh-wide or namespace STRICT mTLS—plaintext pod-to-pod traffic rejected. DestinationRule.tls.mode: ISTIO_MUTUAL ensures client sidecars present valid certs. Start with PERMISSIVE during migration, then enforce STRICT.
OSSM vs upstream Istio (Maistra)
OpenShift Service Mesh (OSSM) is Red Hat's supported distribution built on upstream Istio, historically via the Maistra fork (OpenShift 4.x integration layer). OSSM adds:
- Operator lifecycle tied to OCP versions and Red Hat support
- Integration with OpenShift Routes, SCCs, and platform monitoring
- Curated Istio version + CVE backports between upstream releases
- Upstream Istio direct install possible but unsupported on OCP for production
ServiceMeshMemberRoll
On OpenShift, the control plane lives in istio-system (or openshift-operators namespace depending on version). Tenant namespaces join the mesh via ServiceMeshMemberRoll—lists namespaces allowed to use the shared control plane and receive sidecar injection.
apiVersion: maistra.io/v1
kind: ServiceMeshMemberRoll
metadata:
name: default
namespace: istio-system
spec:
members:
- team-payments
- team-checkout
- team-identity
---
apiVersion: v1
kind: Namespace
metadata:
name: team-payments
labels:
istio-injection: enabled
Kiali + Jaeger bundled
OSSM ships optional addons: Kiali (service graph, health, config validation) and Jaeger (distributed tracing for mesh spans). Enable via ServiceMeshControlPlane / Istio CR addon components. Traces complement Prometheus RED metrics from Envoy—see retry storms and upstream 503 sources in one view.
flowchart LR
subgraph cp["Control plane"]
ISTIOD["istiod / OSSM operator"]
KIALI["Kiali"]
JAEG["Jaeger"]
end
subgraph ns["Namespace: team-payments"]
P1["payments-api\n+ Envoy sidecar"]
P2["ledger-svc\n+ Envoy sidecar"]
end
GW["Gateway / Route"]
ISTIOD --> P1
ISTIOD --> P2
GW --> P1
P1 -->|mTLS| P2
P1 --> JAEG
P2 --> JAEG
KIALI --> P1
$ kubectl get ns --show-labels | grep istio-injection $ kubectl label namespace team-payments istio-injection=enabled --overwrite $ kubectl get virtualservice,destinationrule,gateway -n team-payments $ kubectl exec deploy/payments-api -n team-payments -c istio-proxy -- pilot-agent request GET config_dump $ istioctl analyze -n team-payments$ oc get smcp -n istio-system $ oc get servicemeshmemberroll -n istio-system -o yaml $ oc get servicemeshmember -A $ oc get route -n istio-system kiali $ oc get route -n istio-system jaeger $ oc adm policy add-scc-to-user anyuid -z default -n team-payments → only if sidecar UID conflicts; prefer restricted SCC + proper securityContext
Install OSSM via OperatorHub (Kiali Operator, Servicemesh Operator, Jaeger Operator). Use ServiceMeshControlPlane v2 CR—not manual istioctl install on production OCP. Add namespaces to ServiceMeshMemberRoll before enabling injection.
Move to PeerAuthentication: STRICT after validating all clients use sidecars or mesh gateways. AuthorizationPolicy CRs enforce L7 allow/deny (JWT claims, paths)—network policy alone is insufficient for HTTP semantics.
Headless Services, raw TCP databases, and Jobs that ignore sidecar shutdown can hang with Istio sidecars. Use holdApplicationUntilProxyStarts and sidecar exit hooks, or exclude ports via annotation traffic.sidecar.istio.io/excludeOutboundPorts.
Mesh vs ingress + network policy: Mesh adds mTLS, L7 routing, and deep telemetry at CPU/RAM cost. Simple north-south APIs may need only Ingress/Route + NetworkPolicy; mesh pays off with many east-west microservices and canary requirements.
Draw data plane (Envoy sidecar) vs control plane (istiod). Explain VirtualService vs DestinationRule split. On OCP, mention ServiceMeshMemberRoll and why OSSM/Maistra exists (support matrix, SCC, Routes).
Run istioctl analyze (or Kiali validation) in CI on mesh YAML before merge— misconfigured VirtualServices fail silently with unexpected 404/503 at runtime.