Advanced Topics

Scheduler, Affinity & Taints

● Advanced ⏱ 15 min read

The Kubernetes scheduler decides which node every unscheduled pod lands on. It does this through a two-phase pipeline — filter then score — running a chain of plugins. Understanding this pipeline explains why pods get stuck in Pending, why co-location works, and how to express complex placement requirements without fighting the scheduler.

Scheduler Pipeline

When a pod is created, the scheduler watches for pods with no nodeName. For each unscheduled pod it runs:

Scheduler pipeline — filter then score

All nodes

PreFilter → Filter (NodeResourcesFit, NodeAffinity, TaintToleration, PodTopologySpread…)

↓ feasible nodes only

Feasible nodes

PreScore → Score (NodeAffinity weight, InterPodAffinity, ImageLocality, LeastAllocated…)

↓ highest score wins

Selected node

Reserve → Permit → PreBind → Bind (write nodeName to pod spec)

Pending forever? All nodes failed the Filter phase. Check: node resources, taints/tolerations, node affinity, topology spread constraints, pod affinity.

Filter eliminates nodes that can't run the pod. Score ranks remaining nodes. The highest-scoring node is selected and the pod's nodeName is set.

Filter & Score

Key filter plugins and what they check:

Plugin	Filters out nodes where…
`NodeResourcesFit`	Node doesn't have enough allocatable CPU/memory for the pod's requests.
`NodeAffinity`	Node labels don't match `requiredDuringScheduling` expressions.
`TaintToleration`	Node has a taint the pod doesn't tolerate with `NoSchedule` or `NoExecute` effect.
`PodTopologySpread`	Placing the pod here would exceed `maxSkew` for a `DoNotSchedule` constraint.
`InterPodAffinity`	Required pod affinity/anti-affinity rule cannot be satisfied.
`VolumeBinding`	PVC can't be bound on this node (e.g. wrong zone for regional PD).

Node Affinity

Node affinity replaces nodeSelector with a richer expression language. Hard rules (required) are filters; soft rules (preferred) are scoring weights.

nodeAffinity — hard + soft rules

spec:
  affinity:
    nodeAffinity:
      # Hard: pod won't schedule unless satisfied
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values: ["us-east-1a", "us-east-1b"]

      # Soft: prefer nodes with SSD, but not required
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 80               # 0–100; higher = stronger preference
        preference:
          matchExpressions:
          - key: node.kubernetes.io/instance-type
            operator: In
            values: ["m6i.xlarge", "m6i.2xlarge"]
      - weight: 20
        preference:
          matchExpressions:
          - key: disktype
            operator: In
            values: ["ssd"]

💡

IgnoredDuringExecution

Both required and preferred affinity rules are only enforced at scheduling time. If a node's labels change after a pod is running, the pod is not evicted. A future RequiredDuringExecution variant (still in development) will enforce rules continuously.

Pod Affinity & Anti-Affinity

Pod affinity places pods near other pods (same node or same zone). Pod anti-affinity spreads pods away from each other. Both use topologyKey to define what "near" means.

podAntiAffinity — spread replicas across zones

spec:
  affinity:
    podAntiAffinity:
      # Hard: two replicas of this app cannot be in the same zone
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app: myapp
        topologyKey: topology.kubernetes.io/zone

    podAffinity:
      # Soft: co-locate with the cache pod on the same node (low latency)
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchLabels:
              app: cache
          topologyKey: kubernetes.io/hostname

⚠️

Required pod anti-affinity can block scale-out

If you require one pod per zone and only have 3 zones, you can never scale beyond 3 replicas — new pods will be permanently Pending. Use preferred anti-affinity for the zone spread, and supplement with topology spread constraints for softer enforcement.

Taints & Tolerations (deep dive)

Three taint effects and their behaviour:

Effect	New pods without toleration	Running pods without toleration
`NoSchedule`	Not scheduled onto this node	Continue running (not evicted)
`PreferNoSchedule`	Scheduler avoids this node if possible	Continue running
`NoExecute`	Not scheduled	Evicted immediately (or after `tolerationSeconds`)

toleration with grace period — ride out transient node issues

spec:
  tolerations:
  # Kubernetes adds these automatically — tolerate node.kubernetes.io/not-ready
  # for 300s before evicting (configurable via --default-not-ready-toleration-seconds)
  - key: node.kubernetes.io/not-ready
    operator: Exists
    effect: NoExecute
    tolerationSeconds: 300       # stay on the node for 5 min before eviction

  - key: node.kubernetes.io/unreachable
    operator: Exists
    effect: NoExecute
    tolerationSeconds: 300

  # Wildcard toleration — tolerate any taint (use sparingly)
  - operator: Exists             # no key = match any key

Topology Spread Constraints

Topology spread constraints are more flexible than pod anti-affinity for even distribution. They express "spread pods as evenly as possible across zones" without hard per-zone limits.

spec:
  topologySpreadConstraints:
  # Spread evenly across zones — max 1 pod difference between zones
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule    # hard constraint
    labelSelector:
      matchLabels:
        app: myapp
    minDomains: 3                       # require pods spread across at least 3 zones

  # Also spread across nodes within zones — softer constraint
  - maxSkew: 2
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: ScheduleAnyway   # soft — try but don't block
    labelSelector:
      matchLabels:
        app: myapp

Scheduler Profiles

Multiple scheduler profiles let different workload types use different scoring strategies — e.g., a latency-sensitive profile that prefers least-allocated nodes versus a batch profile that packs pods tightly to save cost.

KubeSchedulerConfiguration — two profiles

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler        # used by all pods by default
  plugins:
    score:
      enabled:
      - name: NodeResourcesBalancedAllocation
        weight: 1
      - name: LeastAllocated
        weight: 1

- schedulerName: batch-scheduler          # used by batch jobs
  plugins:
    score:
      enabled:
      - name: MostAllocated              # pack tightly — save money
        weight: 1

use a custom scheduler profile

spec:
  schedulerName: batch-scheduler          # override default-scheduler

kubectl Commands

# Why is a pod Pending? — check events for scheduler messages
kubectl describe pod mypod -n production | grep -A10 "Events:"

# Which node was the pod scheduled on?
kubectl get pod mypod -o jsonpath='{.spec.nodeName}'

# List node labels (used by affinity and nodeSelector)
kubectl get nodes --show-labels

# Simulate scheduling — check if a pod would fit a node
kubectl get pod mypod -o yaml | kubectl apply --dry-run=server -f -

# List pods per node
kubectl get pods -A -o wide | awk '{print $8}' | sort | uniq -c | sort -rn

# Check taint on all nodes
kubectl get nodes -o json | jq -r '.items[] | "\(.metadata.name): \([.spec.taints[]? | "\(.key)=\(.value // ""):\(.effect)"] | join(", "))"'