Horizontal Pod Autoscaler (HPA)
The Horizontal Pod Autoscaler automatically adjusts the number of pod replicas in a Deployment (or StatefulSet, ReplicaSet) based on observed metrics — CPU utilisation, memory usage, or custom application metrics. When traffic spikes, the HPA adds replicas. When it drops, it scales back down. It does this in a continuous control loop, targeting a configured utilisation percentage.
What Is the HPA?
"Horizontal" scaling means adding more pods (vs. "vertical" scaling, which means giving each pod more CPU/memory). The HPA reads metrics from the Metrics API, computes the desired replica count, and updates the target resource's spec.replicas. The Deployment controller then does the actual scaling.
The HPA targets a percentage of requested resources, not an absolute value. If your pod requests 500m CPU and you set a target of 50%, the HPA tries to keep average CPU usage per pod at 250m by scaling in or out.
How It Works
The HPA controller runs a reconciliation loop every 15 seconds (configurable). Each iteration:
- Query the Metrics API for current metric values across all pods.
- Calculate the desired replica count:
desiredReplicas = ceil(currentReplicas × (currentMetric / targetMetric)) - Clamp to
minReplicas…maxReplicas. - Apply scale-up/scale-down stabilisation windows to prevent flapping.
- Update the target resource's
spec.replicasif the value changed.
Prerequisites
The HPA reads resource metrics (CPU, memory) from the Metrics Server, a cluster add-on that aggregates resource usage from kubelet. Without it, the HPA cannot function for CPU/memory scaling.
# Check if metrics-server is running
kubectl get deployment metrics-server -n kube-system
# Install metrics-server (if missing)
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# Verify it's working
kubectl top nodes
kubectl top pods
Metrics Server provides lightweight, real-time resource metrics for the HPA and kubectl top. It does not store history. Prometheus stores historical metrics and can feed custom/external metrics to the HPA via an adapter. For basic CPU/memory autoscaling, Metrics Server is all you need.
HPA YAML
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60 # target 60% of CPU request
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70 # target 70% of memory request
autoscaling/v1 is deprecated and only supports CPU. autoscaling/v2 supports multiple metrics, memory, custom metrics, and the behavior block for controlling scale-up/down rates. Always use v2.
CPU & Memory Scaling
The HPA measures utilisation as a percentage of the container's resource request. This means resource requests must be set — without a CPU request, the HPA cannot calculate utilisation and will refuse to scale on that metric.
# HPA will NOT work if requests are missing:
containers:
- name: api
image: myapp:1.0
# ← no resources block = HPA cannot scale on CPU/memory
Memory scaling has a subtlety: unlike CPU, memory usage does not drop immediately after load decreases (the OS may keep pages in use). The HPA can cause flapping on memory — scale up on usage, but usage stays high after scale-up because more pods means more total memory, not less. Use conservative memory targets (70–80%) and generous stabilisation windows.
Scale Behavior
The behavior block gives fine-grained control over how fast the HPA scales in each direction.
spec:
behavior:
scaleUp:
stabilizationWindowSeconds: 0 # scale up immediately
policies:
- type: Percent
value: 100 # double replicas max per minute
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # wait 5 min of low usage before scaling down
policies:
- type: Pods
value: 1 # remove at most 1 pod per minute
periodSeconds: 60
| Parameter | Default | Effect |
|---|---|---|
stabilizationWindowSeconds (scale down) | 300s | HPA won't scale down until metric has been below threshold for this long. Prevents thrashing. |
stabilizationWindowSeconds (scale up) | 0s | HPA scales up immediately. Increase if you want to dampen temporary spikes. |
policies[].type: Pods | — | Max N pods changed per period. |
policies[].type: Percent | — | Max N% of current replicas changed per period. |
Custom & External Metrics
Beyond CPU and memory, the HPA can scale on custom metrics (application-emitted, e.g. requests-per-second from Prometheus) or external metrics (outside the cluster, e.g. queue depth from SQS). Both require a metrics adapter (e.g. prometheus-adapter or KEDA) to expose them through the Custom Metrics API.
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100" # 100 req/s per pod
Why Your HPA Isn't Scaling
The most common reasons an HPA fails to scale:
| Symptom | Cause | Fix |
|---|---|---|
unknown for current metrics | Metrics Server not installed or not ready | Install metrics-server; check kubectl top pods |
| Metrics show but replicas don't change | Metric below threshold; still in stabilisation window | Check kubectl describe hpa — look at "Conditions" |
| HPA ignores CPU metric | Pod has no CPU request set | Add resources.requests.cpu to container spec |
| Replicas stuck at maxReplicas | Load exceeds what N pods can handle | Increase maxReplicas, or scale the nodes (Cluster Autoscaler) |
| Rapid scale-up/scale-down cycling | Stabilisation window too short or target too tight | Increase behavior.scaleDown.stabilizationWindowSeconds |
kubectl Commands
# Apply HPA from manifest
kubectl apply -f hpa.yaml
# Create HPA imperatively
kubectl autoscale deployment api --min=2 --max=10 --cpu-percent=60
# Check HPA status
kubectl get hpa
kubectl describe hpa api-hpa
# Watch HPA decisions in real time
kubectl get hpa api-hpa -w
# Check current resource usage
kubectl top pods -l app=api