Scheduler, Affinity & Taints
The Kubernetes scheduler decides which node every unscheduled pod lands on. It does this through a two-phase pipeline — filter then score — running a chain of plugins. Understanding this pipeline explains why pods get stuck in Pending, why co-location works, and how to express complex placement requirements without fighting the scheduler.
Scheduler Pipeline
When a pod is created, the scheduler watches for pods with no nodeName. For each unscheduled pod it runs:
Filter & Score
Key filter plugins and what they check:
| Plugin | Filters out nodes where… |
|---|---|
NodeResourcesFit | Node doesn't have enough allocatable CPU/memory for the pod's requests. |
NodeAffinity | Node labels don't match requiredDuringScheduling expressions. |
TaintToleration | Node has a taint the pod doesn't tolerate with NoSchedule or NoExecute effect. |
PodTopologySpread | Placing the pod here would exceed maxSkew for a DoNotSchedule constraint. |
InterPodAffinity | Required pod affinity/anti-affinity rule cannot be satisfied. |
VolumeBinding | PVC can't be bound on this node (e.g. wrong zone for regional PD). |
Node Affinity
Node affinity replaces nodeSelector with a richer expression language. Hard rules (required) are filters; soft rules (preferred) are scoring weights.
spec:
affinity:
nodeAffinity:
# Hard: pod won't schedule unless satisfied
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values: ["us-east-1a", "us-east-1b"]
# Soft: prefer nodes with SSD, but not required
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80 # 0–100; higher = stronger preference
preference:
matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values: ["m6i.xlarge", "m6i.2xlarge"]
- weight: 20
preference:
matchExpressions:
- key: disktype
operator: In
values: ["ssd"]
Both required and preferred affinity rules are only enforced at scheduling time. If a node's labels change after a pod is running, the pod is not evicted. A future RequiredDuringExecution variant (still in development) will enforce rules continuously.
Pod Affinity & Anti-Affinity
Pod affinity places pods near other pods (same node or same zone). Pod anti-affinity spreads pods away from each other. Both use topologyKey to define what "near" means.
spec:
affinity:
podAntiAffinity:
# Hard: two replicas of this app cannot be in the same zone
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: myapp
topologyKey: topology.kubernetes.io/zone
podAffinity:
# Soft: co-locate with the cache pod on the same node (low latency)
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: cache
topologyKey: kubernetes.io/hostname
If you require one pod per zone and only have 3 zones, you can never scale beyond 3 replicas — new pods will be permanently Pending. Use preferred anti-affinity for the zone spread, and supplement with topology spread constraints for softer enforcement.
Taints & Tolerations (deep dive)
Three taint effects and their behaviour:
| Effect | New pods without toleration | Running pods without toleration |
|---|---|---|
NoSchedule | Not scheduled onto this node | Continue running (not evicted) |
PreferNoSchedule | Scheduler avoids this node if possible | Continue running |
NoExecute | Not scheduled | Evicted immediately (or after tolerationSeconds) |
spec:
tolerations:
# Kubernetes adds these automatically — tolerate node.kubernetes.io/not-ready
# for 300s before evicting (configurable via --default-not-ready-toleration-seconds)
- key: node.kubernetes.io/not-ready
operator: Exists
effect: NoExecute
tolerationSeconds: 300 # stay on the node for 5 min before eviction
- key: node.kubernetes.io/unreachable
operator: Exists
effect: NoExecute
tolerationSeconds: 300
# Wildcard toleration — tolerate any taint (use sparingly)
- operator: Exists # no key = match any key
Topology Spread Constraints
Topology spread constraints are more flexible than pod anti-affinity for even distribution. They express "spread pods as evenly as possible across zones" without hard per-zone limits.
spec:
topologySpreadConstraints:
# Spread evenly across zones — max 1 pod difference between zones
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule # hard constraint
labelSelector:
matchLabels:
app: myapp
minDomains: 3 # require pods spread across at least 3 zones
# Also spread across nodes within zones — softer constraint
- maxSkew: 2
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway # soft — try but don't block
labelSelector:
matchLabels:
app: myapp
Scheduler Profiles
Multiple scheduler profiles let different workload types use different scoring strategies — e.g., a latency-sensitive profile that prefers least-allocated nodes versus a batch profile that packs pods tightly to save cost.
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler # used by all pods by default
plugins:
score:
enabled:
- name: NodeResourcesBalancedAllocation
weight: 1
- name: LeastAllocated
weight: 1
- schedulerName: batch-scheduler # used by batch jobs
plugins:
score:
enabled:
- name: MostAllocated # pack tightly — save money
weight: 1
spec:
schedulerName: batch-scheduler # override default-scheduler
kubectl Commands
# Why is a pod Pending? — check events for scheduler messages
kubectl describe pod mypod -n production | grep -A10 "Events:"
# Which node was the pod scheduled on?
kubectl get pod mypod -o jsonpath='{.spec.nodeName}'
# List node labels (used by affinity and nodeSelector)
kubectl get nodes --show-labels
# Simulate scheduling — check if a pod would fit a node
kubectl get pod mypod -o yaml | kubectl apply --dry-run=server -f -
# List pods per node
kubectl get pods -A -o wide | awk '{print $8}' | sort | uniq -c | sort -rn
# Check taint on all nodes
kubectl get nodes -o json | jq -r '.items[] | "\(.metadata.name): \([.spec.taints[]? | "\(.key)=\(.value // ""):\(.effect)"] | join(", "))"'