Workloads

Horizontal Pod Autoscaler (HPA)

● Intermediate ⏱ 15 min read

The Horizontal Pod Autoscaler automatically adjusts the number of pod replicas in a Deployment (or StatefulSet, ReplicaSet) based on observed metrics — CPU utilisation, memory usage, or custom application metrics. When traffic spikes, the HPA adds replicas. When it drops, it scales back down. It does this in a continuous control loop, targeting a configured utilisation percentage.

What Is the HPA?

"Horizontal" scaling means adding more pods (vs. "vertical" scaling, which means giving each pod more CPU/memory). The HPA reads metrics from the Metrics API, computes the desired replica count, and updates the target resource's spec.replicas. The Deployment controller then does the actual scaling.

The HPA targets a percentage of requested resources, not an absolute value. If your pod requests 500m CPU and you set a target of 50%, the HPA tries to keep average CPU usage per pod at 250m by scaling in or out.

How It Works

The HPA controller runs a reconciliation loop every 15 seconds (configurable). Each iteration:

Query the Metrics API for current metric values across all pods.
Calculate the desired replica count: desiredReplicas = ceil(currentReplicas × (currentMetric / targetMetric))
Clamp to minReplicas…maxReplicas.
Apply scale-up/scale-down stabilisation windows to prevent flapping.
Update the target resource's spec.replicas if the value changed.

HPA scale-up example — target 50% CPU

current replicas3 avg CPU per pod80% of request target CPU50% of request formulaceil(3 × 80/50) = ceil(4.8) = 5 resultScale up to 5 replicas

HPA formula: desired = ceil(currentReplicas × currentMetric/targetMetric)

Prerequisites

The HPA reads resource metrics (CPU, memory) from the Metrics Server, a cluster add-on that aggregates resource usage from kubelet. Without it, the HPA cannot function for CPU/memory scaling.

# Check if metrics-server is running
kubectl get deployment metrics-server -n kube-system

# Install metrics-server (if missing)
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Verify it's working
kubectl top nodes
kubectl top pods

💡

Metrics Server vs Prometheus

Metrics Server provides lightweight, real-time resource metrics for the HPA and kubectl top. It does not store history. Prometheus stores historical metrics and can feed custom/external metrics to the HPA via an adapter. For basic CPU/memory autoscaling, Metrics Server is all you need.

HPA YAML

hpa.yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60    # target 60% of CPU request
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 70    # target 70% of memory request

⚠️

Use autoscaling/v2, not v1

autoscaling/v1 is deprecated and only supports CPU. autoscaling/v2 supports multiple metrics, memory, custom metrics, and the behavior block for controlling scale-up/down rates. Always use v2.

CPU & Memory Scaling

The HPA measures utilisation as a percentage of the container's resource request. This means resource requests must be set — without a CPU request, the HPA cannot calculate utilisation and will refuse to scale on that metric.

# HPA will NOT work if requests are missing:
containers:
- name: api
  image: myapp:1.0
  # ← no resources block = HPA cannot scale on CPU/memory

Memory scaling has a subtlety: unlike CPU, memory usage does not drop immediately after load decreases (the OS may keep pages in use). The HPA can cause flapping on memory — scale up on usage, but usage stays high after scale-up because more pods means more total memory, not less. Use conservative memory targets (70–80%) and generous stabilisation windows.

Scale Behavior

The behavior block gives fine-grained control over how fast the HPA scales in each direction.

hpa.yaml — controlled scale-down

spec:
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0   # scale up immediately
      policies:
      - type: Percent
        value: 100                     # double replicas max per minute
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300  # wait 5 min of low usage before scaling down
      policies:
      - type: Pods
        value: 1                        # remove at most 1 pod per minute
        periodSeconds: 60

Parameter	Default	Effect
`stabilizationWindowSeconds` (scale down)	300s	HPA won't scale down until metric has been below threshold for this long. Prevents thrashing.
`stabilizationWindowSeconds` (scale up)	0s	HPA scales up immediately. Increase if you want to dampen temporary spikes.
`policies[].type: Pods`	—	Max N pods changed per period.
`policies[].type: Percent`	—	Max N% of current replicas changed per period.

Custom & External Metrics

Beyond CPU and memory, the HPA can scale on custom metrics (application-emitted, e.g. requests-per-second from Prometheus) or external metrics (outside the cluster, e.g. queue depth from SQS). Both require a metrics adapter (e.g. prometheus-adapter or KEDA) to expose them through the Custom Metrics API.

Scale on requests-per-second (custom metric via prometheus-adapter)

metrics:
- type: Pods
  pods:
    metric:
      name: http_requests_per_second
    target:
      type: AverageValue
      averageValue: "100"   # 100 req/s per pod

Why Your HPA Isn't Scaling

The most common reasons an HPA fails to scale:

Symptom	Cause	Fix
`unknown` for current metrics	Metrics Server not installed or not ready	Install metrics-server; check `kubectl top pods`
Metrics show but replicas don't change	Metric below threshold; still in stabilisation window	Check `kubectl describe hpa` — look at "Conditions"
HPA ignores CPU metric	Pod has no CPU request set	Add `resources.requests.cpu` to container spec
Replicas stuck at maxReplicas	Load exceeds what N pods can handle	Increase maxReplicas, or scale the nodes (Cluster Autoscaler)
Rapid scale-up/scale-down cycling	Stabilisation window too short or target too tight	Increase `behavior.scaleDown.stabilizationWindowSeconds`

kubectl Commands

# Apply HPA from manifest
kubectl apply -f hpa.yaml

# Create HPA imperatively
kubectl autoscale deployment api --min=2 --max=10 --cpu-percent=60

# Check HPA status
kubectl get hpa
kubectl describe hpa api-hpa

# Watch HPA decisions in real time
kubectl get hpa api-hpa -w

# Check current resource usage
kubectl top pods -l app=api