Storage

Storage Patterns for Stateful Apps

● Advanced ⏱ 20 min read

Running stateless apps on Kubernetes is straightforward. Running stateful workloads — databases, message queues, distributed caches — requires careful thought about storage identity, replication, availability, and recovery. This guide covers the patterns that actually work in production, and the cases where "just use managed" is the right answer.

StatefulSet + PVC Templates

A StatefulSet's volumeClaimTemplates field automatically provisions a dedicated PVC for each pod, named <template-name>-<pod-name>. Unlike Deployment pods that all share one PVC, StatefulSet pods each own private storage — essential for databases where each replica has its own data directory.

statefulset-with-pvc.yaml

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
  namespace: production
spec:
  serviceName: postgres          # headless Service for stable DNS
  replicas: 3
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
      - name: postgres
        image: postgres:16
        env:
        - name: PGDATA
          value: /var/lib/postgresql/data/pgdata
        volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:          # one PVC created per pod
  - metadata:
      name: data
    spec:
      accessModes: [ReadWriteOnce]
      storageClassName: aws-gp3-retain   # use Retain policy for prod data!
      resources:
        requests:
          storage: 100Gi

StatefulSet — per-pod PVCs with stable identity

pod

postgres-0

primary

↓

PVC: data-postgres-0

100 GiB

EBS vol-aaa

pod

postgres-1

replica

↓

PVC: data-postgres-1

100 GiB

EBS vol-bbb

pod

postgres-2

replica

↓

PVC: data-postgres-2

100 GiB

EBS vol-ccc

PVCs survive pod deletion — if postgres-0 crashes and restarts, it binds to the same data-postgres-0 PVC

Each StatefulSet pod owns its PVC. Pod identity (ordinal) stays stable across restarts — the pod always reattaches to the same volume.

⚠️

Deleting a StatefulSet does not delete its PVCs

PVCs created by volumeClaimTemplates are intentionally orphaned when you delete the StatefulSet — Kubernetes won't accidentally delete your database. You must delete the PVCs manually after confirming data is no longer needed. To delete both together: scale to 0 first, then delete the StatefulSet, then delete the PVCs.

RWO Limitations

Most cloud block storage (EBS, GCP PD, Azure Disk) only supports ReadWriteOnce — one node at a time. This creates a constraint for StatefulSets: if a node fails and the pod is rescheduled, the new node must wait for the old node to release the disk. The process typically takes 6–10 minutes before cloud controllers detect the node failure and force-detach the volume.

Strategies to reduce RWO attachment delays:

Set terminationGracePeriodSeconds: 30 — give the pod time to flush and release cleanly before the node is considered failed
Use podDisruptionBudget to limit voluntary disruptions during upgrades
For Redis or Kafka, prefer multi-AZ block storage or switch to a distributed storage layer (Longhorn, Ceph) that supports RWX or has its own replication

Data Replication Patterns

Where replication lives depends on the technology stack:

Pattern	Who replicates	Example	Trade-off
Application-level replication	The database itself	PostgreSQL streaming replication, Redis Sentinel, Kafka ISR	Battle-tested; storage can be plain RWO blocks
Storage-level replication	The distributed storage layer	Ceph/Rook, Longhorn, Portworx	Works for any app; adds storage cluster overhead
Cloud-native snapshots	Cloud provider	EBS snapshots, GCP disk snapshots	Point-in-time; not real-time HA

Volume Expansion

You can grow PVCs on a StorageClass with allowVolumeExpansion: true. For StatefulSets, you can't change volumeClaimTemplates directly — patch the PVCs individually, then update the StatefulSet template (which takes effect on new pods).

# Expand all data PVCs in a StatefulSet with 3 replicas
for i in 0 1 2; do
  kubectl patch pvc data-postgres-$i -n production \
    -p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'
done

# Update the volumeClaimTemplate so future pods (scale-up) also get 200Gi
# (requires deleting and re-creating the StatefulSet with the new template,
#  or using --cascade=orphan to preserve PVCs)

Volume Snapshots

Kubernetes volume snapshots (GA in 1.20) let you take point-in-time copies of a PVC using the storage provider's snapshot mechanism — no application quiescing required at the Kubernetes level (though crash-consistent vs application-consistent is up to the app).

volume-snapshot.yaml

# Create a snapshot
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: postgres-snap-2024-01
  namespace: production
spec:
  volumeSnapshotClassName: csi-aws-vsc   # VolumeSnapshotClass for EBS
  source:
    persistentVolumeClaimName: data-postgres-0
---
# Restore from snapshot — create a new PVC pre-populated with snapshot data
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-restored
  namespace: production
spec:
  accessModes: [ReadWriteOnce]
  storageClassName: aws-gp3
  resources:
    requests:
      storage: 100Gi
  dataSource:
    name: postgres-snap-2024-01
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io

Backup with Velero

Velero is the standard Kubernetes backup tool. It backs up Kubernetes resource manifests (to object storage) and optionally volume data (via filesystem backup or volume snapshots).

# Install Velero with AWS backend
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.10 \
  --bucket my-velero-backups \
  --backup-location-config region=us-east-1

# Create a backup of the production namespace (resources + volume snapshots)
velero backup create prod-backup-$(date +%F) \
  --include-namespaces production \
  --snapshot-volumes

# Restore from a backup
velero restore create --from-backup prod-backup-2024-01-15

# Schedule daily backups at 02:00
velero schedule create daily-prod \
  --schedule="0 2 * * *" \
  --include-namespaces production \
  --ttl 720h   # keep for 30 days

Managed vs Self-Hosted

Before running a database on Kubernetes, honestly assess the operational cost:

	Self-hosted on K8s	Managed service
HA setup	You configure replication, failover, and fencing	Automatic
Upgrades	You manage rolling upgrades across replicas	Automatic or one-click
Backups	You run Velero or custom jobs, test restores	Automatic, point-in-time restore included
Encryption	You configure TLS and etcd Secret encryption	Automatic, usually with KMS integration
Cost	Cluster compute; no license fee	Service premium (typically 2–3× raw compute)
Expertise needed	Deep DB + Kubernetes knowledge required	SQL/API only

💡

When to self-host on Kubernetes

Self-hosting makes sense when: you need to run in a private network without cloud egress, you're already using a Kubernetes operator (Zalando postgres-operator, Strimzi Kafka, CloudNativePG) that handles HA and upgrades, or cost at scale makes the managed service premium prohibitive. For most teams starting out, managed databases buy back engineering time that compounds over years.

kubectl Commands

# List all PVCs created by a StatefulSet's volumeClaimTemplates
kubectl get pvc -n production -l app=postgres

# Check which node a StatefulSet pod is running on (for AZ awareness)
kubectl get pods -n production -o wide -l app=postgres

# Describe a PVC to see capacity, StorageClass, bound PV
kubectl describe pvc data-postgres-0 -n production

# Scale a StatefulSet down (PVCs are preserved)
kubectl scale statefulset postgres --replicas=0 -n production

# View volume attachment status on a node
kubectl describe node node-1 | grep -A 10 "Attached Volumes"

# Force-detach a stuck RWO volume (node offline) — use with care
kubectl delete volumeattachment <attachment-name>

# List VolumeSnapshots
kubectl get volumesnapshot -n production