Foundations

Kubernetes Architecture

● Beginner ⏱ 15 min read

A Kubernetes cluster consists of two types of machines: a control plane (one or more nodes that manage the cluster) and worker nodes (machines that run your workloads). Every component is designed to be independently replaceable and horizontally scalable. This guide walks through each component, what it does, and how they interact.

Kubernetes Cluster

Control Plane

kube-apiserver Front door — all requests authenticated, validated, and stored here. Port 6443 (HTTPS).

etcd Distributed key-value store. Holds all cluster state. Back it up.

kube-scheduler Watches unscheduled pods, picks a node based on resources & constraints.

kube-controller-manager Runs reconcile loops: Node, ReplicaSet, Deployment, Job, EndpointSlice controllers.

⟷

Worker Node

kubelet Agent on every node. Runs pods, reports status to API server via CRI.

kube-proxy Maintains iptables/IPVS rules for Service routing on this node.

container runtime Pulls images and runs containers. Usually containerd or CRI-O.

Pod (app container)

The control plane manages cluster state; worker nodes run your workloads

Cluster Overview

At the highest level, a Kubernetes cluster is a set of machines (nodes) that run containerised applications. Every cluster has at minimum:

A control plane — the brain of the cluster. Manages state, makes scheduling decisions, and exposes the Kubernetes API.
At least one worker node — runs the pods (groups of containers) that make up your applications.

In production, the control plane runs on dedicated nodes (often 3 for high availability) and worker nodes are separate. In development clusters (like minikube), everything runs on a single node.

Control Plane

The control plane is responsible for maintaining the desired state of the cluster — what applications are running, which container images they use, and how many replicas of each. It consists of four main components.

kube-apiserver

The API server is the front door to Kubernetes. Every operation in the cluster — whether triggered by kubectl, a CI/CD pipeline, or an internal controller — goes through the API server. It:

Validates and processes REST API requests
Is the only component that reads from and writes to etcd
Authenticates and authorises every request (RBAC)
Listens on port 6443 (HTTPS)

The API server is designed to scale horizontally — you can run multiple instances behind a load balancer for high availability.

etcd

etcd is a consistent, distributed key-value store used as Kubernetes' backing store for all cluster data. It holds the complete state of the cluster: what pods exist, what services are configured, what secrets are stored, what nodes have joined.

⚠️

Back up etcd

If you lose etcd without a backup, you lose your cluster's entire state. In production, etcd should run as a 3 or 5-node cluster for fault tolerance, and you should take regular snapshots: etcdctl snapshot save snapshot.db.

kube-scheduler

The scheduler watches for newly created pods that have no node assigned, and selects a node for them to run on. It evaluates multiple factors:

Resource requirements (CPU, memory requests/limits)
Hardware/software/policy constraints (node selectors, affinity/anti-affinity rules)
Data locality and inter-workload interference
Deadlines and taints/tolerations

The scheduler does not run pods — it just decides where they should run, writing the decision back to the API server.

kube-controller-manager

The controller manager runs a collection of control loops (controllers) that watch cluster state and make changes to move the actual state toward the desired state. Key controllers include:

Controller	Responsibility
Node controller	Notices and responds when nodes go down
Job controller	Watches Job objects and creates pods to run one-off tasks
EndpointSlice controller	Populates EndpointSlice objects (linking Services to Pods)
ServiceAccount controller	Creates default ServiceAccounts for new namespaces
ReplicaSet controller	Maintains the correct number of pod replicas
Deployment controller	Manages Deployments, creating/updating ReplicaSets

Worker Nodes

Worker nodes are the machines that actually run your workloads. Every worker node runs three core components.

kubelet

The kubelet is an agent that runs on every worker node. It receives pod specifications (PodSpecs) from the API server and ensures the containers described in them are running and healthy. Specifically:

Talks to the container runtime via the Container Runtime Interface (CRI)
Reports node and pod status back to the API server
Runs liveness, readiness, and startup probes
Manages pod lifecycle (creation, restart, deletion)

The kubelet does not manage containers that were not created by Kubernetes — it only manages pods.

kube-proxy

kube-proxy runs on each node and maintains network rules that allow network communication to pods from sessions inside or outside the cluster. It implements part of the Kubernetes Service concept — when you create a Service, kube-proxy creates iptables (or IPVS) rules that route traffic to the correct pod endpoints.

Container Runtime

The container runtime is the software responsible for pulling container images and running them. Kubernetes supports any runtime that implements the CRI (Container Runtime Interface). Common choices:

Runtime	Notes
containerd	Default for most managed K8s offerings (EKS, GKE, AKS). Lightweight, OCI-compliant.
CRI-O	Purpose-built for Kubernetes, used by OpenShift. Minimal footprint.
Docker Engine (via cri-dockerd)	Docker support was deprecated in K8s 1.20 and removed in 1.24. Uses cri-dockerd shim.

Add-ons

Add-ons extend the functionality of a Kubernetes cluster. They use cluster resources (DaemonSets, Deployments, etc.) to implement cluster features. Essential add-ons include:

CoreDNS — provides DNS for the cluster. Every Service gets a DNS name. Required for service discovery.
CNI plugin (Calico, Flannel, Cilium) — provides pod networking, implementing the Kubernetes network model.
Metrics Server — provides resource metrics (CPU/memory) for Horizontal Pod Autoscaler and kubectl top.
Dashboard — optional web UI for cluster management.

API Request Flow

Understanding how a kubectl apply command flows through the cluster helps demystify Kubernetes. When you run kubectl apply -f deployment.yaml:

kubectl reads your kubeconfig, authenticates, and sends an HTTP request to the kube-apiserver.
The API server validates the manifest, authorises the request via RBAC, and persists the object to etcd.
The Deployment controller (in kube-controller-manager) notices the new Deployment and creates a ReplicaSet.
The ReplicaSet controller notices it needs N pods and creates Pod objects in etcd.
The kube-scheduler notices unscheduled pods and assigns each one to a node.
The kubelet on the chosen node notices the pod assignment, instructs containerd to pull the image and start the container.
containerd starts the container; kubelet reports running status back to the API server.

The entire process is event-driven and eventually consistent. Each component watches for its specific changes and reacts — no component directly calls another.

kubectl Your terminal
kubectl apply -f

HTTPS REST →

kube-apiserver Validates & writes
to etcd

watch event →

controller-manager Creates ReplicaSet
and Pod objects

assigns node →

scheduler Writes nodeName
to pod spec

watch event →

kubelet Pulls image,
starts container

API request flow for kubectl apply -f deployment.yaml — no component calls another directly

The Reconcile Loop

Every controller in Kubernetes follows the same pattern: observe, diff, act. This is the reconcile loop (also called the control loop).

Observe — the controller reads the current state of the cluster from the API server using an efficient watch mechanism (not polling).
Diff — it compares the current state against the desired state as declared in the relevant objects (Deployment, ReplicaSet, etc.).
Act — if there is a gap, it takes the smallest action necessary to move toward the desired state and writes the result back to the API server.

Observe Watch Controller reads
current state from
API server (event-driven)

→

Diff Compare Current state vs
desired state in
the object spec

→

Act Reconcile Apply the smallest
action to close
the gap

↩

Every controller in Kubernetes runs this loop continuously — crash-safe because each step is idempotent

This loop runs continuously. Because controllers watch for events rather than polling on a timer, Kubernetes reacts to drift within milliseconds. The reconcile loop also makes Kubernetes eventually consistent: the system will keep trying until actual state matches desired state, even across retries and restarts.

💡

Why idempotency matters

Because a controller can run its reconcile loop at any time — including after a crash and restart — every action must be idempotent: applying it twice produces the same result as applying it once. This is why Kubernetes uses desired-state declarations rather than one-time commands.

Node Lifecycle

Nodes are the worker machines that run your workloads. Understanding how they join, report health, and are removed is essential for cluster operations.

Joining a cluster

When a node starts, its kubelet automatically registers itself with the API server (via a POST to /api/v1/nodes). In kubeadm-managed clusters, the join command provides the bootstrap token and API server address:

kubeadm join <api-server>:6443 \
  --token <bootstrap-token> \
  --discovery-token-ca-cert-hash sha256:<hash>

Node conditions

The kubelet reports node health to the API server through Node conditions:

Condition	Meaning when True
`Ready`	Node is healthy and accepting pods. `False` = issues. `Unknown` = node controller lost contact (>40s).
`MemoryPressure`	Node is running low on memory.
`DiskPressure`	Disk capacity or inodes are low.
`PIDPressure`	Too many processes on the node.

Removing a node

Before decommissioning a node, drain it to gracefully evict all pods:

# Prevent new pods from being scheduled on the node
kubectl cordon node-1

# Evict existing pods (respects PodDisruptionBudgets)
kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data

# Remove the node object once the machine is shut down
kubectl delete node node-1

DaemonSet pods are excluded from draining by design — they run on every node intentionally. The --ignore-daemonsets flag is almost always required.

High Availability

A single-node control plane is a single point of failure. In production, you run multiple control plane nodes.

etcd quorum

etcd uses the Raft consensus algorithm, which requires a quorum (majority) to accept writes. Tolerable failures for common cluster sizes:

etcd nodes	Quorum required	Failures tolerated
1	1	0 — no fault tolerance
3	2	1
5	3	2

Three control plane nodes is the standard minimum for production. Five nodes are used when you want to tolerate two simultaneous control plane failures.

Stacked vs external etcd

In a stacked topology, etcd runs on the same nodes as the other control plane components — simpler to operate. In an external etcd topology, etcd runs on dedicated nodes — stronger isolation but more hardware. Most managed Kubernetes offerings (EKS, GKE, AKS) abstract this completely.

⚠️

Back up etcd regularly

Even with 3 or 5 etcd nodes, take regular snapshots. A bug, operator error, or data corruption event can affect all etcd nodes simultaneously. Store snapshots outside the cluster: etcdctl snapshot save /backup/etcd-$(date +%Y%m%d).db