Production Operations

Node Pools & Node Management

● Advanced ⏱ 15 min read

Not all workloads belong on the same hardware. ML training jobs need GPUs. Latency-sensitive services need dedicated CPU. Batch jobs can tolerate spot interruptions. Node pools let you carve the cluster into segments with different machine types, taints, and autoscaling rules — and route workloads to the right segment via labels, taints, and affinity rules.

Node Pools

A node pool (called a node group on EKS, node pool on GKE/AKS) is a set of nodes with identical configuration: machine type, OS image, disk size, labels, and taints. Managed clusters manage node pool lifecycle — adding, draining, and replacing nodes.

Pool	Machine type	Taint	Workloads
system	2 CPU / 8 GB	CriticalAddonsOnly	CoreDNS, metrics-server, autoscaler
general	4 CPU / 16 GB	—	Most application workloads
memory	8 CPU / 64 GB	workload=memory-optimized	Redis, in-memory caches, JVM heaps
gpu	8 CPU / 32 GB + GPU	nvidia.com/gpu=present	ML inference, GPU workloads
spot	4 CPU / 16 GB (spot)	spot=true	Batch jobs, CI runners, non-critical workers

Node Labels & Selectors

Labels are the low-friction way to steer workloads. Managed clusters auto-label nodes with cloud-provider metadata. Add custom labels at pool creation or manually.

# Label a node
kubectl label node node-1 workload-type=memory-optimized

# Remove a label
kubectl label node node-1 workload-type-

# Show node labels
kubectl get nodes --show-labels

# Select nodes with a label in a pod spec
spec:
  nodeSelector:
    workload-type: memory-optimized

Taints & Tolerations

Labels are opt-in (pods choose nodes). Taints are opt-out — they repel pods unless the pod has a matching toleration. Use taints to dedicate nodes exclusively to specific workloads.

Taints repel pods — toleration is the pass

GPU NODE — tainted

taint: gpu=present:NoSchedule

✅ ML pod (has toleration)

❌ API pod (no toleration) — repelled

❌ DB pod (no toleration) — repelled

GENERAL NODE — no taint

(no taint — all pods land here)

✅ API pod

✅ DB pod

⚠️ ML pod also lands here unless nodeSelector/affinity added

Exclusive dedication: taint the node (repel others) + add nodeAffinity to the workload (pull it to the right pool). Toleration alone does not guarantee placement.

Taints repel all pods without a matching toleration. To dedicate a node pool exclusively: taint it AND add nodeAffinity/nodeSelector to the intended workload.

taint a node and add a toleration to a pod

# Taint a node (NoSchedule — new pods without toleration won't land here)
kubectl taint nodes node-1 gpu=present:NoSchedule

# NoExecute — also evicts existing pods without toleration
kubectl taint nodes node-1 gpu=present:NoExecute

# Remove a taint
kubectl taint nodes node-1 gpu=present:NoSchedule-

# Toleration in pod spec
spec:
  tolerations:
  - key: "gpu"
    operator: "Equal"
    value: "present"
    effect: "NoSchedule"

Node Affinity

Node affinity expresses scheduling preferences more precisely than nodeSelector. Use requiredDuringSchedulingIgnoredDuringExecution for hard rules and preferredDuringSchedulingIgnoredDuringExecution for soft preferences.

nodeAffinity — require GPU nodes, prefer zone A

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: cloud.google.com/gke-accelerator
            operator: Exists               # must have a GPU
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 80
        preference:
          matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values: ["us-central1-a"]     # prefer zone A, but not required

Cluster Autoscaler

Cluster Autoscaler (CA) watches for pods stuck in Pending due to insufficient resources. When found, it adds nodes to the pool. When nodes are underutilized (below 50% for 10 minutes by default), it drains and removes them.

cluster-autoscaler deployment (AWS EKS example)

containers:
- name: cluster-autoscaler
  image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.31.0
  command:
  - ./cluster-autoscaler
  - --cloud-provider=aws
  - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/production
  - --scale-down-delay-after-add=10m
  - --scale-down-unneeded-time=10m
  - --scale-down-utilization-threshold=0.5
  - --skip-nodes-with-local-storage=false
  - --expander=least-waste          # which node group to expand: least-waste, random, most-pods

💡

CA vs HPA — different dimensions

HPA scales the number of pods within the current node capacity. CA scales the number of nodes when pod demand exceeds node capacity. They work together: HPA creates pods → CA adds nodes to fit them.

Spot & Preemptible Nodes

Spot instances (AWS) and preemptible VMs (GCP) are 60–90% cheaper than on-demand but can be interrupted with 2 minutes notice. Strategy: run fault-tolerant batch workloads on spot, stateless replicated services with >2 replicas on spot + on-demand mix, and stateful services on on-demand only.

taint spot nodes and tolerate in batch jobs

# Managed node groups auto-label spot nodes:
# node.kubernetes.io/lifecycle=spot (EKS)
# cloud.google.com/gke-spot=true (GKE)

# Add taint so non-spot-aware workloads don't land on spot
kubectl taint node spot-node-1 spot=true:NoSchedule

# Batch job that tolerates spot interruption
spec:
  tolerations:
  - key: spot
    operator: Equal
    value: "true"
    effect: NoSchedule
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
          - key: node.kubernetes.io/lifecycle
            operator: In
            values: ["spot"]

Node Problem Detector

Node Problem Detector (NPD) runs as a DaemonSet and surfaces kernel errors, disk failures, OOM events, and network issues as Node Conditions and Events. Without NPD, nodes fail silently — pods get evicted but you don't know why.

# Install via Helm
helm repo add deliveryhero https://charts.deliveryhero.io/
helm install node-problem-detector deliveryhero/node-problem-detector \
  -n kube-system

# After install — check node conditions added by NPD
kubectl describe node node-1 | grep -A5 "Conditions:"
# FrequentKubeletRestart    False   ...
# KernelDeadlock            False   ...
# ReadonlyFilesystem        False   ...

kubectl Commands

# List all nodes with labels and status
kubectl get nodes --show-labels

# Describe a node — capacity, allocatable, conditions, pods
kubectl describe node node-1

# List pods on a specific node
kubectl get pods -A --field-selector=spec.nodeName=node-1

# Check node resource pressure
kubectl top nodes

# List all taints across all nodes
kubectl get nodes -o json | jq -r '.items[] | "\(.metadata.name): \(.spec.taints // [] | map("\(.key)=\(.value):\(.effect)") | join(", "))"'

# Cordon a node (stop new scheduling)
kubectl cordon node-1

# Uncordon (re-enable scheduling)
kubectl uncordon node-1