Node Pools & Node Management
Not all workloads belong on the same hardware. ML training jobs need GPUs. Latency-sensitive services need dedicated CPU. Batch jobs can tolerate spot interruptions. Node pools let you carve the cluster into segments with different machine types, taints, and autoscaling rules — and route workloads to the right segment via labels, taints, and affinity rules.
Node Pools
A node pool (called a node group on EKS, node pool on GKE/AKS) is a set of nodes with identical configuration: machine type, OS image, disk size, labels, and taints. Managed clusters manage node pool lifecycle — adding, draining, and replacing nodes.
| Pool | Machine type | Taint | Workloads |
|---|---|---|---|
| system | 2 CPU / 8 GB | CriticalAddonsOnly | CoreDNS, metrics-server, autoscaler |
| general | 4 CPU / 16 GB | — | Most application workloads |
| memory | 8 CPU / 64 GB | workload=memory-optimized | Redis, in-memory caches, JVM heaps |
| gpu | 8 CPU / 32 GB + GPU | nvidia.com/gpu=present | ML inference, GPU workloads |
| spot | 4 CPU / 16 GB (spot) | spot=true | Batch jobs, CI runners, non-critical workers |
Node Labels & Selectors
Labels are the low-friction way to steer workloads. Managed clusters auto-label nodes with cloud-provider metadata. Add custom labels at pool creation or manually.
# Label a node
kubectl label node node-1 workload-type=memory-optimized
# Remove a label
kubectl label node node-1 workload-type-
# Show node labels
kubectl get nodes --show-labels
# Select nodes with a label in a pod spec
spec:
nodeSelector:
workload-type: memory-optimized
Taints & Tolerations
Labels are opt-in (pods choose nodes). Taints are opt-out — they repel pods unless the pod has a matching toleration. Use taints to dedicate nodes exclusively to specific workloads.
# Taint a node (NoSchedule — new pods without toleration won't land here)
kubectl taint nodes node-1 gpu=present:NoSchedule
# NoExecute — also evicts existing pods without toleration
kubectl taint nodes node-1 gpu=present:NoExecute
# Remove a taint
kubectl taint nodes node-1 gpu=present:NoSchedule-
# Toleration in pod spec
spec:
tolerations:
- key: "gpu"
operator: "Equal"
value: "present"
effect: "NoSchedule"
Node Affinity
Node affinity expresses scheduling preferences more precisely than nodeSelector. Use requiredDuringSchedulingIgnoredDuringExecution for hard rules and preferredDuringSchedulingIgnoredDuringExecution for soft preferences.
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-accelerator
operator: Exists # must have a GPU
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values: ["us-central1-a"] # prefer zone A, but not required
Cluster Autoscaler
Cluster Autoscaler (CA) watches for pods stuck in Pending due to insufficient resources. When found, it adds nodes to the pool. When nodes are underutilized (below 50% for 10 minutes by default), it drains and removes them.
containers:
- name: cluster-autoscaler
image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.31.0
command:
- ./cluster-autoscaler
- --cloud-provider=aws
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/production
- --scale-down-delay-after-add=10m
- --scale-down-unneeded-time=10m
- --scale-down-utilization-threshold=0.5
- --skip-nodes-with-local-storage=false
- --expander=least-waste # which node group to expand: least-waste, random, most-pods
HPA scales the number of pods within the current node capacity. CA scales the number of nodes when pod demand exceeds node capacity. They work together: HPA creates pods → CA adds nodes to fit them.
Spot & Preemptible Nodes
Spot instances (AWS) and preemptible VMs (GCP) are 60–90% cheaper than on-demand but can be interrupted with 2 minutes notice. Strategy: run fault-tolerant batch workloads on spot, stateless replicated services with >2 replicas on spot + on-demand mix, and stateful services on on-demand only.
# Managed node groups auto-label spot nodes:
# node.kubernetes.io/lifecycle=spot (EKS)
# cloud.google.com/gke-spot=true (GKE)
# Add taint so non-spot-aware workloads don't land on spot
kubectl taint node spot-node-1 spot=true:NoSchedule
# Batch job that tolerates spot interruption
spec:
tolerations:
- key: spot
operator: Equal
value: "true"
effect: NoSchedule
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: node.kubernetes.io/lifecycle
operator: In
values: ["spot"]
Node Problem Detector
Node Problem Detector (NPD) runs as a DaemonSet and surfaces kernel errors, disk failures, OOM events, and network issues as Node Conditions and Events. Without NPD, nodes fail silently — pods get evicted but you don't know why.
# Install via Helm
helm repo add deliveryhero https://charts.deliveryhero.io/
helm install node-problem-detector deliveryhero/node-problem-detector \
-n kube-system
# After install — check node conditions added by NPD
kubectl describe node node-1 | grep -A5 "Conditions:"
# FrequentKubeletRestart False ...
# KernelDeadlock False ...
# ReadonlyFilesystem False ...
kubectl Commands
# List all nodes with labels and status
kubectl get nodes --show-labels
# Describe a node — capacity, allocatable, conditions, pods
kubectl describe node node-1
# List pods on a specific node
kubectl get pods -A --field-selector=spec.nodeName=node-1
# Check node resource pressure
kubectl top nodes
# List all taints across all nodes
kubectl get nodes -o json | jq -r '.items[] | "\(.metadata.name): \(.spec.taints // [] | map("\(.key)=\(.value):\(.effect)") | join(", "))"'
# Cordon a node (stop new scheduling)
kubectl cordon node-1
# Uncordon (re-enable scheduling)
kubectl uncordon node-1