Production Operations

Multi-Cluster Patterns

● Advanced ⏱ 20 min read

A single cluster is simpler to operate. Multiple clusters add complexity — more control planes, more kubeconfigs, cross-cluster networking to solve. But beyond a certain scale or compliance requirement, multiple clusters become necessary. This guide covers when to split, the common patterns, and the tooling that makes fleet management tractable.

Why Multiple Clusters

DriverWhy a separate cluster
Environment isolationDev, staging, prod on separate clusters — a misconfigured RBAC in dev doesn't touch prod.
Blast radiusA cluster-level incident (etcd corruption, bad upgrade) affects only that cluster's workloads.
Compliance / data residencyGDPR or regulatory requirements mandate workloads and data stay in a specific region or jurisdiction.
Multi-region availabilityRun active clusters in us-east-1 and eu-west-1; route traffic to the healthy region during outages.
Team autonomyLarge orgs give platform teams their own cluster to avoid shared control plane contention.
Specialised hardwareGPU clusters for ML teams; ARM clusters for cost savings; FIPS clusters for defence workloads.

Common Topologies

Multi-cluster topologies — hub-and-spoke vs. active-active
Hub-and-Spoke (GitOps)
Management Cluster
ArgoCD / Flux / Rancher
prod-us
prod-eu
staging
Single pane for deploys. Management cluster pushes manifests to member clusters.
Active-Active (Multi-Region)
Global Load Balancer
AWS Route53 / Cloudflare
us-east-1
cluster
eu-west-1
cluster
Both serve traffic. Health-check failover routes to healthy region if one goes down.
Hub-and-spoke: one management cluster deploys to many member clusters via GitOps. Active-active: multiple independent clusters behind a global load balancer for HA and low-latency routing.

Fleet Management

ToolModelBest for
Cluster API (CAPI)Declarative cluster lifecycle — create, upgrade, delete clusters via Kubernetes CRDs.Homogeneous self-managed fleets. Infrastructure-as-code for clusters.
ArgoCD ApplicationSetOne ApplicationSet generates an ArgoCD Application per cluster from a template.GitOps deployment to many clusters simultaneously.
Flux + Cluster APIFlux manages manifests; CAPI manages the clusters themselves.Full lifecycle GitOps — both cluster and app config in git.
Rancher / Rancher MCMUI + API for multi-cluster management, policies, and monitoring.Ops teams needing a UI; mixed cloud environments.
ArgoCD ApplicationSet — deploy to all prod clusters
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: myapp
  namespace: argocd
spec:
  generators:
  - clusters:
      selector:
        matchLabels:
          environment: production        # targets all clusters labelled env=production
  template:
    metadata:
      name: "myapp-{{name}}"            # {{name}} = cluster name
    spec:
      project: default
      source:
        repoURL: https://github.com/myorg/gitops
        targetRevision: main
        path: apps/myapp/overlays/production
      destination:
        server: "{{server}}"            # {{server}} = cluster API endpoint
        namespace: myapp
      syncPolicy:
        automated:
          prune: true
          selfHeal: true

GitOps Multi-Cluster

The standard repo structure for multi-cluster GitOps: one folder per cluster or per environment, with a base and per-cluster overlays.

gitops-repo/
├── apps/
│   └── myapp/
│       ├── base/             ← shared manifests
│       │   ├── deployment.yaml
│       │   └── service.yaml
│       └── overlays/
│           ├── production-us/
│           │   └── kustomization.yaml  ← us-specific patches (replicas, resources)
│           └── production-eu/
│               └── kustomization.yaml  ← eu-specific patches
└── clusters/
    ├── prod-us/              ← ArgoCD/Flux cluster config
    └── prod-eu/

Cross-Cluster Service Discovery

Kubernetes DNS (svc.cluster.local) is cluster-scoped — it cannot resolve services in other clusters. Options for cross-cluster service communication:

ApproachHow it works
External DNS + LoadBalancerEach cluster exposes services via LoadBalancer; DNS resolves to the other cluster's LB IP. Simple but adds latency and LB cost.
SubmarinerExtends K8s Service and DNS across clusters via IPSec tunnels. svc.cluster.local works cross-cluster.
Istio multi-clusterEast-west gateway between clusters. Service mesh spans multiple clusters; mTLS transparent to apps.
KubeFed ServiceExport/ImportKubernetes SIG Multicluster API — export a Service from one cluster, import it into another.

Traffic Splitting & Failover

AWS Route 53 — weighted multi-cluster routing
# Route 70% traffic to us-east-1 cluster, 30% to eu-west-1
aws route53 change-resource-record-sets --hosted-zone-id ZONE_ID \
  --change-batch '{
    "Changes": [
      {"Action":"UPSERT","ResourceRecordSet":{
        "Name":"api.example.com","Type":"A",
        "SetIdentifier":"us-east-1",
        "Weight":70,
        "AliasTarget":{"DNSName":"us-east-1-alb.amazonaws.com","EvaluateTargetHealth":true,"HostedZoneId":"Z35SXDOTRQ7X7K"}
      }},
      {"Action":"UPSERT","ResourceRecordSet":{
        "Name":"api.example.com","Type":"A",
        "SetIdentifier":"eu-west-1",
        "Weight":30,
        "AliasTarget":{"DNSName":"eu-west-1-alb.amazonaws.com","EvaluateTargetHealth":true,"HostedZoneId":"Z32O12XQLNTSW2"}
      }}
    ]
  }'

kubeconfig & Context Management

# Merge multiple kubeconfigs
KUBECONFIG=~/.kube/prod-us.yaml:~/.kube/prod-eu.yaml:~/.kube/staging.yaml \
  kubectl config view --flatten > ~/.kube/config

# List all contexts
kubectl config get-contexts

# Switch context
kubectl config use-context prod-us

# Run a command against a specific context without switching
kubectl --context=prod-eu get pods -n myapp

# Install kubectx for fast switching
brew install kubectx
kubectx prod-eu    # switch cluster
kubens myapp       # switch namespace