Backup & Disaster Recovery
A Kubernetes cluster has two things to protect: the control plane state (etcd — all API objects, Secrets, ConfigMaps, RBAC, CRDs) and the workload state (persistent volume data, application-level databases). Losing either without a recovery plan means starting from scratch. Most teams don't discover their DR plan is broken until they need it.
What to Back Up
etcd Snapshots
For self-managed clusters, etcd snapshots are the control plane backup. On managed clusters (EKS, GKE, AKS), the provider backs up etcd automatically — but recovery is often "restore the whole cluster", not "restore a specific object".
# Take a snapshot (run on a control-plane node)
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key
# Verify the snapshot
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-20240115-1000.db \
--write-out=table
# Output:
# HASH REVISION TOTAL KEYS TOTAL SIZE
# abc12345 87432 1203 4.2 MB
# Stop kube-apiserver (move its static pod manifest out)
mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
# Restore the snapshot to a new data directory
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-20240115-1000.db \
--data-dir=/var/lib/etcd-restore \
--name=master-1 \
--initial-cluster=master-1=https://192.168.1.10:2380 \
--initial-advertise-peer-urls=https://192.168.1.10:2380
# Update etcd static pod to use the restored data dir
# Edit /etc/kubernetes/manifests/etcd.yaml:
# --data-dir=/var/lib/etcd-restore
# Move kube-apiserver manifest back
mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/
Restoring an etcd snapshot rolls back ALL cluster state to the snapshot point — including Deployments, Secrets, RBAC, and CRDs. Any changes made after the snapshot are lost. Always test restore in a non-production cluster before you need it in production.
Velero — Workload Backup
Velero backs up Kubernetes API objects and PersistentVolume data to object storage (S3, GCS, Azure Blob). It can restore to the same cluster or a different one — making it a migration tool as well.
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.9.0 \
--bucket velero-backups-mycluster \
--backup-location-config region=us-east-1 \
--snapshot-location-config region=us-east-1 \
--secret-file ./credentials-velero # AWS credentials file
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: nightly-full
namespace: velero
spec:
schedule: "0 2 * * *" # 02:00 UTC daily
template:
ttl: 720h # 30 days
storageLocation: default
volumeSnapshotLocations:
- default
includedNamespaces:
- production
- staging
snapshotVolumes: true # snapshot PVs with CSI or cloud snapshots
defaultVolumesToFsBackup: false # use volume snapshots, not file-level backup
# Take an on-demand backup
velero backup create pre-upgrade-backup \
--include-namespaces production \
--snapshot-volumes
# Check backup status
velero backup describe pre-upgrade-backup
velero backup logs pre-upgrade-backup
# List available backups
velero backup get
# Restore a namespace to a different cluster
velero restore create --from-backup pre-upgrade-backup \
--include-namespaces production \
--namespace-mappings production:production-restored
Restore Procedures
Different failure scenarios require different recovery approaches:
| Scenario | Recovery approach |
|---|---|
| Accidental resource deletion | velero restore a recent backup. For GitOps clusters, kubectl apply the manifest from git. |
| Namespace data corruption | velero restore specific namespace from last known-good backup. |
| etcd corruption (self-managed) | Restore etcd snapshot. Restart kube-apiserver. Verify cluster state. |
| Control plane node lost | Replace node, rejoin to etcd cluster, or restore from snapshot on a new node. |
| Full cluster loss | Provision new cluster → restore Velero backup → update DNS. RTO depends on cluster provisioning time. |
RTO & RPO
Set realistic targets before an incident, not during one:
- RPO (Recovery Point Objective) — how much data loss is acceptable? A 6-hour backup schedule means up to 6 hours of data can be lost. For databases, use continuous WAL archiving alongside volume snapshots.
- RTO (Recovery Time Objective) — how long can you be down? Cluster provisioning on EKS takes ~15 minutes. Add Velero restore time (depends on volume size). Full recovery from total loss is rarely under 30 minutes.
Multi-region active-active with a global load balancer and database replication has near-zero RTO — a region fails and traffic shifts automatically. This is more expensive than backups but the right approach for truly critical services. Backups are for lower-SLA tiers.
GitOps as DR
If all cluster manifests live in git and you use ESO or SOPS for secrets, rebuilding a cluster from scratch takes:
- Provision a new cluster (15 min on managed).
- Install Flux or ArgoCD (2 min).
- Point it at the git repo — all workloads reconcile automatically (5–15 min).
- Restore PV data from Velero backup (time depends on data volume).
This is why GitOps dramatically simplifies DR: git is the runbook and the recovery tool simultaneously.
Testing Recovery
An untested backup is not a backup. Quarterly recovery drills should be mandatory:
# Quarterly DR drill checklist:
# 1. Spin up a test cluster
# 2. Restore latest Velero backup to test cluster
# 3. Verify all deployments come up (kubectl get pods -A)
# 4. Spot-check application data in restored PVs
# 5. Measure actual RTO — compare to your RTO target
# 6. Document any gaps and fix before next quarter
# Automate backup verification
velero backup describe latest-backup --details | grep -E "Phase|Errors"
# Expected: Phase: Completed, Errors: 0
kubectl Commands
# Check Velero backup storage location
velero backup-location get
# Check all schedules
velero schedule get
# Check Velero pod health
kubectl get pods -n velero
# List recent backups with status
velero backup get --output table
# Describe a restore operation
velero restore describe my-restore --details
# For etcd: check etcd cluster health
ETCDCTL_API=3 etcdctl endpoint health \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key