Storage Patterns for Stateful Apps
Running stateless apps on Kubernetes is straightforward. Running stateful workloads — databases, message queues, distributed caches — requires careful thought about storage identity, replication, availability, and recovery. This guide covers the patterns that actually work in production, and the cases where "just use managed" is the right answer.
StatefulSet + PVC Templates
A StatefulSet's volumeClaimTemplates field automatically provisions a dedicated PVC for each pod, named <template-name>-<pod-name>. Unlike Deployment pods that all share one PVC, StatefulSet pods each own private storage — essential for databases where each replica has its own data directory.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
namespace: production
spec:
serviceName: postgres # headless Service for stable DNS
replicas: 3
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
containers:
- name: postgres
image: postgres:16
env:
- name: PGDATA
value: /var/lib/postgresql/data/pgdata
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
volumeClaimTemplates: # one PVC created per pod
- metadata:
name: data
spec:
accessModes: [ReadWriteOnce]
storageClassName: aws-gp3-retain # use Retain policy for prod data!
resources:
requests:
storage: 100Gi
PVCs created by volumeClaimTemplates are intentionally orphaned when you delete the StatefulSet — Kubernetes won't accidentally delete your database. You must delete the PVCs manually after confirming data is no longer needed. To delete both together: scale to 0 first, then delete the StatefulSet, then delete the PVCs.
RWO Limitations
Most cloud block storage (EBS, GCP PD, Azure Disk) only supports ReadWriteOnce — one node at a time. This creates a constraint for StatefulSets: if a node fails and the pod is rescheduled, the new node must wait for the old node to release the disk. The process typically takes 6–10 minutes before cloud controllers detect the node failure and force-detach the volume.
Strategies to reduce RWO attachment delays:
- Set
terminationGracePeriodSeconds: 30— give the pod time to flush and release cleanly before the node is considered failed - Use
podDisruptionBudgetto limit voluntary disruptions during upgrades - For Redis or Kafka, prefer multi-AZ block storage or switch to a distributed storage layer (Longhorn, Ceph) that supports RWX or has its own replication
Data Replication Patterns
Where replication lives depends on the technology stack:
| Pattern | Who replicates | Example | Trade-off |
|---|---|---|---|
| Application-level replication | The database itself | PostgreSQL streaming replication, Redis Sentinel, Kafka ISR | Battle-tested; storage can be plain RWO blocks |
| Storage-level replication | The distributed storage layer | Ceph/Rook, Longhorn, Portworx | Works for any app; adds storage cluster overhead |
| Cloud-native snapshots | Cloud provider | EBS snapshots, GCP disk snapshots | Point-in-time; not real-time HA |
Volume Expansion
You can grow PVCs on a StorageClass with allowVolumeExpansion: true. For StatefulSets, you can't change volumeClaimTemplates directly — patch the PVCs individually, then update the StatefulSet template (which takes effect on new pods).
# Expand all data PVCs in a StatefulSet with 3 replicas
for i in 0 1 2; do
kubectl patch pvc data-postgres-$i -n production \
-p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'
done
# Update the volumeClaimTemplate so future pods (scale-up) also get 200Gi
# (requires deleting and re-creating the StatefulSet with the new template,
# or using --cascade=orphan to preserve PVCs)
Volume Snapshots
Kubernetes volume snapshots (GA in 1.20) let you take point-in-time copies of a PVC using the storage provider's snapshot mechanism — no application quiescing required at the Kubernetes level (though crash-consistent vs application-consistent is up to the app).
# Create a snapshot
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: postgres-snap-2024-01
namespace: production
spec:
volumeSnapshotClassName: csi-aws-vsc # VolumeSnapshotClass for EBS
source:
persistentVolumeClaimName: data-postgres-0
---
# Restore from snapshot — create a new PVC pre-populated with snapshot data
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-restored
namespace: production
spec:
accessModes: [ReadWriteOnce]
storageClassName: aws-gp3
resources:
requests:
storage: 100Gi
dataSource:
name: postgres-snap-2024-01
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
Backup with Velero
Velero is the standard Kubernetes backup tool. It backs up Kubernetes resource manifests (to object storage) and optionally volume data (via filesystem backup or volume snapshots).
# Install Velero with AWS backend
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.10 \
--bucket my-velero-backups \
--backup-location-config region=us-east-1
# Create a backup of the production namespace (resources + volume snapshots)
velero backup create prod-backup-$(date +%F) \
--include-namespaces production \
--snapshot-volumes
# Restore from a backup
velero restore create --from-backup prod-backup-2024-01-15
# Schedule daily backups at 02:00
velero schedule create daily-prod \
--schedule="0 2 * * *" \
--include-namespaces production \
--ttl 720h # keep for 30 days
Managed vs Self-Hosted
Before running a database on Kubernetes, honestly assess the operational cost:
| Self-hosted on K8s | Managed service | |
|---|---|---|
| HA setup | You configure replication, failover, and fencing | Automatic |
| Upgrades | You manage rolling upgrades across replicas | Automatic or one-click |
| Backups | You run Velero or custom jobs, test restores | Automatic, point-in-time restore included |
| Encryption | You configure TLS and etcd Secret encryption | Automatic, usually with KMS integration |
| Cost | Cluster compute; no license fee | Service premium (typically 2–3× raw compute) |
| Expertise needed | Deep DB + Kubernetes knowledge required | SQL/API only |
Self-hosting makes sense when: you need to run in a private network without cloud egress, you're already using a Kubernetes operator (Zalando postgres-operator, Strimzi Kafka, CloudNativePG) that handles HA and upgrades, or cost at scale makes the managed service premium prohibitive. For most teams starting out, managed databases buy back engineering time that compounds over years.
kubectl Commands
# List all PVCs created by a StatefulSet's volumeClaimTemplates
kubectl get pvc -n production -l app=postgres
# Check which node a StatefulSet pod is running on (for AZ awareness)
kubectl get pods -n production -o wide -l app=postgres
# Describe a PVC to see capacity, StorageClass, bound PV
kubectl describe pvc data-postgres-0 -n production
# Scale a StatefulSet down (PVCs are preserved)
kubectl scale statefulset postgres --replicas=0 -n production
# View volume attachment status on a node
kubectl describe node node-1 | grep -A 10 "Attached Volumes"
# Force-detach a stuck RWO volume (node offline) — use with care
kubectl delete volumeattachment <attachment-name>
# List VolumeSnapshots
kubectl get volumesnapshot -n production