Production Operations

Backup & Disaster Recovery

● Advanced ⏱ 15 min read

A Kubernetes cluster has two things to protect: the control plane state (etcd — all API objects, Secrets, ConfigMaps, RBAC, CRDs) and the workload state (persistent volume data, application-level databases). Losing either without a recovery plan means starting from scratch. Most teams don't discover their DR plan is broken until they need it.

What to Back Up

Kubernetes backup targets — control plane vs. workload data

CONTROL PLANE (etcd)

• All K8s API objects

• Secrets & ConfigMaps

• RBAC roles & bindings

• CRDs and CR instances

• Service accounts

Tool: etcdctl snapshot or managed-cluster backup API

WORKLOAD DATA

• PersistentVolume contents

• Database files (Postgres, MySQL)

• Object storage (MinIO, S3)

• Stateful app data

Tool: Velero + volume snapshots, or app-level dumps

With GitOps: K8s manifests are already in git — you only need etcd backup for runtime state (Secrets, lease objects) and Velero for PV data.

etcd holds all Kubernetes API state. Velero backs up workload manifests and volume snapshots. Both are needed for full recovery.

etcd Snapshots

For self-managed clusters, etcd snapshots are the control plane backup. On managed clusters (EKS, GKE, AKS), the provider backs up etcd automatically — but recovery is often "restore the whole cluster", not "restore a specific object".

etcd snapshot — take and verify

# Take a snapshot (run on a control-plane node)
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key

# Verify the snapshot
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-20240115-1000.db \
  --write-out=table

# Output:
# HASH       REVISION  TOTAL KEYS  TOTAL SIZE
# abc12345   87432     1203        4.2 MB

etcd restore — recover a control plane

# Stop kube-apiserver (move its static pod manifest out)
mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/

# Restore the snapshot to a new data directory
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-20240115-1000.db \
  --data-dir=/var/lib/etcd-restore \
  --name=master-1 \
  --initial-cluster=master-1=https://192.168.1.10:2380 \
  --initial-advertise-peer-urls=https://192.168.1.10:2380

# Update etcd static pod to use the restored data dir
# Edit /etc/kubernetes/manifests/etcd.yaml:
# --data-dir=/var/lib/etcd-restore

# Move kube-apiserver manifest back
mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/

⚠️

etcd restore is destructive

Restoring an etcd snapshot rolls back ALL cluster state to the snapshot point — including Deployments, Secrets, RBAC, and CRDs. Any changes made after the snapshot are lost. Always test restore in a non-production cluster before you need it in production.

Velero — Workload Backup

Velero backs up Kubernetes API objects and PersistentVolume data to object storage (S3, GCS, Azure Blob). It can restore to the same cluster or a different one — making it a migration tool as well.

install Velero with AWS S3 backend

velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.9.0 \
  --bucket velero-backups-mycluster \
  --backup-location-config region=us-east-1 \
  --snapshot-location-config region=us-east-1 \
  --secret-file ./credentials-velero    # AWS credentials file

scheduled backup — nightly, keep 30 days

apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: nightly-full
  namespace: velero
spec:
  schedule: "0 2 * * *"          # 02:00 UTC daily
  template:
    ttl: 720h                     # 30 days
    storageLocation: default
    volumeSnapshotLocations:
    - default
    includedNamespaces:
    - production
    - staging
    snapshotVolumes: true         # snapshot PVs with CSI or cloud snapshots
    defaultVolumesToFsBackup: false  # use volume snapshots, not file-level backup

Velero backup and restore commands

# Take an on-demand backup
velero backup create pre-upgrade-backup \
  --include-namespaces production \
  --snapshot-volumes

# Check backup status
velero backup describe pre-upgrade-backup
velero backup logs pre-upgrade-backup

# List available backups
velero backup get

# Restore a namespace to a different cluster
velero restore create --from-backup pre-upgrade-backup \
  --include-namespaces production \
  --namespace-mappings production:production-restored

Restore Procedures

Different failure scenarios require different recovery approaches:

Scenario	Recovery approach
Accidental resource deletion	`velero restore` a recent backup. For GitOps clusters, `kubectl apply` the manifest from git.
Namespace data corruption	`velero restore` specific namespace from last known-good backup.
etcd corruption (self-managed)	Restore etcd snapshot. Restart kube-apiserver. Verify cluster state.
Control plane node lost	Replace node, rejoin to etcd cluster, or restore from snapshot on a new node.
Full cluster loss	Provision new cluster → restore Velero backup → update DNS. RTO depends on cluster provisioning time.

RTO & RPO

Set realistic targets before an incident, not during one:

RPO (Recovery Point Objective) — how much data loss is acceptable? A 6-hour backup schedule means up to 6 hours of data can be lost. For databases, use continuous WAL archiving alongside volume snapshots.
RTO (Recovery Time Objective) — how long can you be down? Cluster provisioning on EKS takes ~15 minutes. Add Velero restore time (depends on volume size). Full recovery from total loss is rarely under 30 minutes.

💡

The fastest DR is not needing DR

Multi-region active-active with a global load balancer and database replication has near-zero RTO — a region fails and traffic shifts automatically. This is more expensive than backups but the right approach for truly critical services. Backups are for lower-SLA tiers.

GitOps as DR

If all cluster manifests live in git and you use ESO or SOPS for secrets, rebuilding a cluster from scratch takes:

Provision a new cluster (15 min on managed).
Install Flux or ArgoCD (2 min).
Point it at the git repo — all workloads reconcile automatically (5–15 min).
Restore PV data from Velero backup (time depends on data volume).

This is why GitOps dramatically simplifies DR: git is the runbook and the recovery tool simultaneously.

Testing Recovery

An untested backup is not a backup. Quarterly recovery drills should be mandatory:

# Quarterly DR drill checklist:
# 1. Spin up a test cluster
# 2. Restore latest Velero backup to test cluster
# 3. Verify all deployments come up (kubectl get pods -A)
# 4. Spot-check application data in restored PVs
# 5. Measure actual RTO — compare to your RTO target
# 6. Document any gaps and fix before next quarter

# Automate backup verification
velero backup describe latest-backup --details | grep -E "Phase|Errors"
# Expected: Phase: Completed, Errors: 0

kubectl Commands

# Check Velero backup storage location
velero backup-location get

# Check all schedules
velero schedule get

# Check Velero pod health
kubectl get pods -n velero

# List recent backups with status
velero backup get --output table

# Describe a restore operation
velero restore describe my-restore --details

# For etcd: check etcd cluster health
ETCDCTL_API=3 etcdctl endpoint health \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key