Workloads

Horizontal Pod Autoscaler (HPA)

● Intermediate ⏱ 15 min read

The Horizontal Pod Autoscaler automatically adjusts the number of pod replicas in a Deployment (or StatefulSet, ReplicaSet) based on observed metrics — CPU utilisation, memory usage, or custom application metrics. When traffic spikes, the HPA adds replicas. When it drops, it scales back down. It does this in a continuous control loop, targeting a configured utilisation percentage.

What Is the HPA?

"Horizontal" scaling means adding more pods (vs. "vertical" scaling, which means giving each pod more CPU/memory). The HPA reads metrics from the Metrics API, computes the desired replica count, and updates the target resource's spec.replicas. The Deployment controller then does the actual scaling.

The HPA targets a percentage of requested resources, not an absolute value. If your pod requests 500m CPU and you set a target of 50%, the HPA tries to keep average CPU usage per pod at 250m by scaling in or out.

How It Works

The HPA controller runs a reconciliation loop every 15 seconds (configurable). Each iteration:

  1. Query the Metrics API for current metric values across all pods.
  2. Calculate the desired replica count: desiredReplicas = ceil(currentReplicas × (currentMetric / targetMetric))
  3. Clamp to minReplicasmaxReplicas.
  4. Apply scale-up/scale-down stabilisation windows to prevent flapping.
  5. Update the target resource's spec.replicas if the value changed.
HPA scale-up example — target 50% CPU
current replicas3 avg CPU per pod80% of request target CPU50% of request formulaceil(3 × 80/50) = ceil(4.8) = 5 resultScale up to 5 replicas
HPA formula: desired = ceil(currentReplicas × currentMetric/targetMetric)

Prerequisites

The HPA reads resource metrics (CPU, memory) from the Metrics Server, a cluster add-on that aggregates resource usage from kubelet. Without it, the HPA cannot function for CPU/memory scaling.

# Check if metrics-server is running
kubectl get deployment metrics-server -n kube-system

# Install metrics-server (if missing)
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Verify it's working
kubectl top nodes
kubectl top pods
💡
Metrics Server vs Prometheus

Metrics Server provides lightweight, real-time resource metrics for the HPA and kubectl top. It does not store history. Prometheus stores historical metrics and can feed custom/external metrics to the HPA via an adapter. For basic CPU/memory autoscaling, Metrics Server is all you need.

HPA YAML

hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60    # target 60% of CPU request
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 70    # target 70% of memory request
⚠️
Use autoscaling/v2, not v1

autoscaling/v1 is deprecated and only supports CPU. autoscaling/v2 supports multiple metrics, memory, custom metrics, and the behavior block for controlling scale-up/down rates. Always use v2.

CPU & Memory Scaling

The HPA measures utilisation as a percentage of the container's resource request. This means resource requests must be set — without a CPU request, the HPA cannot calculate utilisation and will refuse to scale on that metric.

# HPA will NOT work if requests are missing:
containers:
- name: api
  image: myapp:1.0
  # ← no resources block = HPA cannot scale on CPU/memory

Memory scaling has a subtlety: unlike CPU, memory usage does not drop immediately after load decreases (the OS may keep pages in use). The HPA can cause flapping on memory — scale up on usage, but usage stays high after scale-up because more pods means more total memory, not less. Use conservative memory targets (70–80%) and generous stabilisation windows.

Scale Behavior

The behavior block gives fine-grained control over how fast the HPA scales in each direction.

hpa.yaml — controlled scale-down
spec:
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0   # scale up immediately
      policies:
      - type: Percent
        value: 100                     # double replicas max per minute
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300  # wait 5 min of low usage before scaling down
      policies:
      - type: Pods
        value: 1                        # remove at most 1 pod per minute
        periodSeconds: 60
ParameterDefaultEffect
stabilizationWindowSeconds (scale down)300sHPA won't scale down until metric has been below threshold for this long. Prevents thrashing.
stabilizationWindowSeconds (scale up)0sHPA scales up immediately. Increase if you want to dampen temporary spikes.
policies[].type: PodsMax N pods changed per period.
policies[].type: PercentMax N% of current replicas changed per period.

Custom & External Metrics

Beyond CPU and memory, the HPA can scale on custom metrics (application-emitted, e.g. requests-per-second from Prometheus) or external metrics (outside the cluster, e.g. queue depth from SQS). Both require a metrics adapter (e.g. prometheus-adapter or KEDA) to expose them through the Custom Metrics API.

Scale on requests-per-second (custom metric via prometheus-adapter)
metrics:
- type: Pods
  pods:
    metric:
      name: http_requests_per_second
    target:
      type: AverageValue
      averageValue: "100"   # 100 req/s per pod

Why Your HPA Isn't Scaling

The most common reasons an HPA fails to scale:

SymptomCauseFix
unknown for current metricsMetrics Server not installed or not readyInstall metrics-server; check kubectl top pods
Metrics show but replicas don't changeMetric below threshold; still in stabilisation windowCheck kubectl describe hpa — look at "Conditions"
HPA ignores CPU metricPod has no CPU request setAdd resources.requests.cpu to container spec
Replicas stuck at maxReplicasLoad exceeds what N pods can handleIncrease maxReplicas, or scale the nodes (Cluster Autoscaler)
Rapid scale-up/scale-down cyclingStabilisation window too short or target too tightIncrease behavior.scaleDown.stabilizationWindowSeconds

kubectl Commands

# Apply HPA from manifest
kubectl apply -f hpa.yaml

# Create HPA imperatively
kubectl autoscale deployment api --min=2 --max=10 --cpu-percent=60

# Check HPA status
kubectl get hpa
kubectl describe hpa api-hpa

# Watch HPA decisions in real time
kubectl get hpa api-hpa -w

# Check current resource usage
kubectl top pods -l app=api