Advanced Topics

Scheduler, Affinity & Taints

● Advanced ⏱ 15 min read

The Kubernetes scheduler decides which node every unscheduled pod lands on. It does this through a two-phase pipeline — filter then score — running a chain of plugins. Understanding this pipeline explains why pods get stuck in Pending, why co-location works, and how to express complex placement requirements without fighting the scheduler.

Scheduler Pipeline

When a pod is created, the scheduler watches for pods with no nodeName. For each unscheduled pod it runs:

Scheduler pipeline — filter then score
All nodes
PreFilter → Filter (NodeResourcesFit, NodeAffinity, TaintToleration, PodTopologySpread…)
↓ feasible nodes only
Feasible nodes
PreScore → Score (NodeAffinity weight, InterPodAffinity, ImageLocality, LeastAllocated…)
↓ highest score wins
Selected node
Reserve → Permit → PreBind → Bind (write nodeName to pod spec)
Pending forever? All nodes failed the Filter phase. Check: node resources, taints/tolerations, node affinity, topology spread constraints, pod affinity.
Filter eliminates nodes that can't run the pod. Score ranks remaining nodes. The highest-scoring node is selected and the pod's nodeName is set.

Filter & Score

Key filter plugins and what they check:

PluginFilters out nodes where…
NodeResourcesFitNode doesn't have enough allocatable CPU/memory for the pod's requests.
NodeAffinityNode labels don't match requiredDuringScheduling expressions.
TaintTolerationNode has a taint the pod doesn't tolerate with NoSchedule or NoExecute effect.
PodTopologySpreadPlacing the pod here would exceed maxSkew for a DoNotSchedule constraint.
InterPodAffinityRequired pod affinity/anti-affinity rule cannot be satisfied.
VolumeBindingPVC can't be bound on this node (e.g. wrong zone for regional PD).

Node Affinity

Node affinity replaces nodeSelector with a richer expression language. Hard rules (required) are filters; soft rules (preferred) are scoring weights.

nodeAffinity — hard + soft rules
spec:
  affinity:
    nodeAffinity:
      # Hard: pod won't schedule unless satisfied
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values: ["us-east-1a", "us-east-1b"]

      # Soft: prefer nodes with SSD, but not required
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 80               # 0–100; higher = stronger preference
        preference:
          matchExpressions:
          - key: node.kubernetes.io/instance-type
            operator: In
            values: ["m6i.xlarge", "m6i.2xlarge"]
      - weight: 20
        preference:
          matchExpressions:
          - key: disktype
            operator: In
            values: ["ssd"]
💡
IgnoredDuringExecution

Both required and preferred affinity rules are only enforced at scheduling time. If a node's labels change after a pod is running, the pod is not evicted. A future RequiredDuringExecution variant (still in development) will enforce rules continuously.

Pod Affinity & Anti-Affinity

Pod affinity places pods near other pods (same node or same zone). Pod anti-affinity spreads pods away from each other. Both use topologyKey to define what "near" means.

podAntiAffinity — spread replicas across zones
spec:
  affinity:
    podAntiAffinity:
      # Hard: two replicas of this app cannot be in the same zone
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app: myapp
        topologyKey: topology.kubernetes.io/zone

    podAffinity:
      # Soft: co-locate with the cache pod on the same node (low latency)
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchLabels:
              app: cache
          topologyKey: kubernetes.io/hostname
⚠️
Required pod anti-affinity can block scale-out

If you require one pod per zone and only have 3 zones, you can never scale beyond 3 replicas — new pods will be permanently Pending. Use preferred anti-affinity for the zone spread, and supplement with topology spread constraints for softer enforcement.

Taints & Tolerations (deep dive)

Three taint effects and their behaviour:

EffectNew pods without tolerationRunning pods without toleration
NoScheduleNot scheduled onto this nodeContinue running (not evicted)
PreferNoScheduleScheduler avoids this node if possibleContinue running
NoExecuteNot scheduledEvicted immediately (or after tolerationSeconds)
toleration with grace period — ride out transient node issues
spec:
  tolerations:
  # Kubernetes adds these automatically — tolerate node.kubernetes.io/not-ready
  # for 300s before evicting (configurable via --default-not-ready-toleration-seconds)
  - key: node.kubernetes.io/not-ready
    operator: Exists
    effect: NoExecute
    tolerationSeconds: 300       # stay on the node for 5 min before eviction

  - key: node.kubernetes.io/unreachable
    operator: Exists
    effect: NoExecute
    tolerationSeconds: 300

  # Wildcard toleration — tolerate any taint (use sparingly)
  - operator: Exists             # no key = match any key

Topology Spread Constraints

Topology spread constraints are more flexible than pod anti-affinity for even distribution. They express "spread pods as evenly as possible across zones" without hard per-zone limits.

spec:
  topologySpreadConstraints:
  # Spread evenly across zones — max 1 pod difference between zones
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule    # hard constraint
    labelSelector:
      matchLabels:
        app: myapp
    minDomains: 3                       # require pods spread across at least 3 zones

  # Also spread across nodes within zones — softer constraint
  - maxSkew: 2
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: ScheduleAnyway   # soft — try but don't block
    labelSelector:
      matchLabels:
        app: myapp

Scheduler Profiles

Multiple scheduler profiles let different workload types use different scoring strategies — e.g., a latency-sensitive profile that prefers least-allocated nodes versus a batch profile that packs pods tightly to save cost.

KubeSchedulerConfiguration — two profiles
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler        # used by all pods by default
  plugins:
    score:
      enabled:
      - name: NodeResourcesBalancedAllocation
        weight: 1
      - name: LeastAllocated
        weight: 1

- schedulerName: batch-scheduler          # used by batch jobs
  plugins:
    score:
      enabled:
      - name: MostAllocated              # pack tightly — save money
        weight: 1
use a custom scheduler profile
spec:
  schedulerName: batch-scheduler          # override default-scheduler

kubectl Commands

# Why is a pod Pending? — check events for scheduler messages
kubectl describe pod mypod -n production | grep -A10 "Events:"

# Which node was the pod scheduled on?
kubectl get pod mypod -o jsonpath='{.spec.nodeName}'

# List node labels (used by affinity and nodeSelector)
kubectl get nodes --show-labels

# Simulate scheduling — check if a pod would fit a node
kubectl get pod mypod -o yaml | kubectl apply --dry-run=server -f -

# List pods per node
kubectl get pods -A -o wide | awk '{print $8}' | sort | uniq -c | sort -rn

# Check taint on all nodes
kubectl get nodes -o json | jq -r '.items[] | "\(.metadata.name): \([.spec.taints[]? | "\(.key)=\(.value // ""):\(.effect)"] | join(", "))"'