Foundations

Kubernetes Architecture

● Beginner ⏱ 15 min read

A Kubernetes cluster consists of two types of machines: a control plane (one or more nodes that manage the cluster) and worker nodes (machines that run your workloads). Every component is designed to be independently replaceable and horizontally scalable. This guide walks through each component, what it does, and how they interact.

Kubernetes Cluster
Control Plane
kube-apiserver Front door — all requests authenticated, validated, and stored here. Port 6443 (HTTPS).
etcd Distributed key-value store. Holds all cluster state. Back it up.
kube-scheduler Watches unscheduled pods, picks a node based on resources & constraints.
kube-controller-manager Runs reconcile loops: Node, ReplicaSet, Deployment, Job, EndpointSlice controllers.
Worker Node
kubelet Agent on every node. Runs pods, reports status to API server via CRI.
kube-proxy Maintains iptables/IPVS rules for Service routing on this node.
container runtime Pulls images and runs containers. Usually containerd or CRI-O.
Pod (app container)
Pod (app container)
The control plane manages cluster state; worker nodes run your workloads

Cluster Overview

At the highest level, a Kubernetes cluster is a set of machines (nodes) that run containerised applications. Every cluster has at minimum:

In production, the control plane runs on dedicated nodes (often 3 for high availability) and worker nodes are separate. In development clusters (like minikube), everything runs on a single node.

Control Plane

The control plane is responsible for maintaining the desired state of the cluster — what applications are running, which container images they use, and how many replicas of each. It consists of four main components.

kube-apiserver

The API server is the front door to Kubernetes. Every operation in the cluster — whether triggered by kubectl, a CI/CD pipeline, or an internal controller — goes through the API server. It:

The API server is designed to scale horizontally — you can run multiple instances behind a load balancer for high availability.

etcd

etcd is a consistent, distributed key-value store used as Kubernetes' backing store for all cluster data. It holds the complete state of the cluster: what pods exist, what services are configured, what secrets are stored, what nodes have joined.

⚠️
Back up etcd

If you lose etcd without a backup, you lose your cluster's entire state. In production, etcd should run as a 3 or 5-node cluster for fault tolerance, and you should take regular snapshots: etcdctl snapshot save snapshot.db.

kube-scheduler

The scheduler watches for newly created pods that have no node assigned, and selects a node for them to run on. It evaluates multiple factors:

The scheduler does not run pods — it just decides where they should run, writing the decision back to the API server.

kube-controller-manager

The controller manager runs a collection of control loops (controllers) that watch cluster state and make changes to move the actual state toward the desired state. Key controllers include:

ControllerResponsibility
Node controllerNotices and responds when nodes go down
Job controllerWatches Job objects and creates pods to run one-off tasks
EndpointSlice controllerPopulates EndpointSlice objects (linking Services to Pods)
ServiceAccount controllerCreates default ServiceAccounts for new namespaces
ReplicaSet controllerMaintains the correct number of pod replicas
Deployment controllerManages Deployments, creating/updating ReplicaSets

Worker Nodes

Worker nodes are the machines that actually run your workloads. Every worker node runs three core components.

kubelet

The kubelet is an agent that runs on every worker node. It receives pod specifications (PodSpecs) from the API server and ensures the containers described in them are running and healthy. Specifically:

The kubelet does not manage containers that were not created by Kubernetes — it only manages pods.

kube-proxy

kube-proxy runs on each node and maintains network rules that allow network communication to pods from sessions inside or outside the cluster. It implements part of the Kubernetes Service concept — when you create a Service, kube-proxy creates iptables (or IPVS) rules that route traffic to the correct pod endpoints.

Container Runtime

The container runtime is the software responsible for pulling container images and running them. Kubernetes supports any runtime that implements the CRI (Container Runtime Interface). Common choices:

RuntimeNotes
containerdDefault for most managed K8s offerings (EKS, GKE, AKS). Lightweight, OCI-compliant.
CRI-OPurpose-built for Kubernetes, used by OpenShift. Minimal footprint.
Docker Engine (via cri-dockerd)Docker support was deprecated in K8s 1.20 and removed in 1.24. Uses cri-dockerd shim.

Add-ons

Add-ons extend the functionality of a Kubernetes cluster. They use cluster resources (DaemonSets, Deployments, etc.) to implement cluster features. Essential add-ons include:

API Request Flow

Understanding how a kubectl apply command flows through the cluster helps demystify Kubernetes. When you run kubectl apply -f deployment.yaml:

  1. kubectl reads your kubeconfig, authenticates, and sends an HTTP request to the kube-apiserver.
  2. The API server validates the manifest, authorises the request via RBAC, and persists the object to etcd.
  3. The Deployment controller (in kube-controller-manager) notices the new Deployment and creates a ReplicaSet.
  4. The ReplicaSet controller notices it needs N pods and creates Pod objects in etcd.
  5. The kube-scheduler notices unscheduled pods and assigns each one to a node.
  6. The kubelet on the chosen node notices the pod assignment, instructs containerd to pull the image and start the container.
  7. containerd starts the container; kubelet reports running status back to the API server.

The entire process is event-driven and eventually consistent. Each component watches for its specific changes and reacts — no component directly calls another.

kubectl Your terminal
kubectl apply -f
HTTPS REST
kube-apiserver Validates & writes
to etcd
watch event
controller-manager Creates ReplicaSet
and Pod objects
assigns node
scheduler Writes nodeName
to pod spec
watch event
kubelet Pulls image,
starts container
API request flow for kubectl apply -f deployment.yaml — no component calls another directly

The Reconcile Loop

Every controller in Kubernetes follows the same pattern: observe, diff, act. This is the reconcile loop (also called the control loop).

  1. Observe — the controller reads the current state of the cluster from the API server using an efficient watch mechanism (not polling).
  2. Diff — it compares the current state against the desired state as declared in the relevant objects (Deployment, ReplicaSet, etc.).
  3. Act — if there is a gap, it takes the smallest action necessary to move toward the desired state and writes the result back to the API server.
Observe Watch Controller reads
current state from
API server (event-driven)
Diff Compare Current state vs
desired state in
the object spec
Act Reconcile Apply the smallest
action to close
the gap
Every controller in Kubernetes runs this loop continuously — crash-safe because each step is idempotent

This loop runs continuously. Because controllers watch for events rather than polling on a timer, Kubernetes reacts to drift within milliseconds. The reconcile loop also makes Kubernetes eventually consistent: the system will keep trying until actual state matches desired state, even across retries and restarts.

💡
Why idempotency matters

Because a controller can run its reconcile loop at any time — including after a crash and restart — every action must be idempotent: applying it twice produces the same result as applying it once. This is why Kubernetes uses desired-state declarations rather than one-time commands.

Node Lifecycle

Nodes are the worker machines that run your workloads. Understanding how they join, report health, and are removed is essential for cluster operations.

Joining a cluster

When a node starts, its kubelet automatically registers itself with the API server (via a POST to /api/v1/nodes). In kubeadm-managed clusters, the join command provides the bootstrap token and API server address:

kubeadm join <api-server>:6443 \
  --token <bootstrap-token> \
  --discovery-token-ca-cert-hash sha256:<hash>

Node conditions

The kubelet reports node health to the API server through Node conditions:

ConditionMeaning when True
ReadyNode is healthy and accepting pods. False = issues. Unknown = node controller lost contact (>40s).
MemoryPressureNode is running low on memory.
DiskPressureDisk capacity or inodes are low.
PIDPressureToo many processes on the node.

Removing a node

Before decommissioning a node, drain it to gracefully evict all pods:

# Prevent new pods from being scheduled on the node
kubectl cordon node-1

# Evict existing pods (respects PodDisruptionBudgets)
kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data

# Remove the node object once the machine is shut down
kubectl delete node node-1

DaemonSet pods are excluded from draining by design — they run on every node intentionally. The --ignore-daemonsets flag is almost always required.

High Availability

A single-node control plane is a single point of failure. In production, you run multiple control plane nodes.

etcd quorum

etcd uses the Raft consensus algorithm, which requires a quorum (majority) to accept writes. Tolerable failures for common cluster sizes:

etcd nodesQuorum requiredFailures tolerated
110 — no fault tolerance
321
532

Three control plane nodes is the standard minimum for production. Five nodes are used when you want to tolerate two simultaneous control plane failures.

Stacked vs external etcd

In a stacked topology, etcd runs on the same nodes as the other control plane components — simpler to operate. In an external etcd topology, etcd runs on dedicated nodes — stronger isolation but more hardware. Most managed Kubernetes offerings (EKS, GKE, AKS) abstract this completely.

⚠️
Back up etcd regularly

Even with 3 or 5 etcd nodes, take regular snapshots. A bug, operator error, or data corruption event can affect all etcd nodes simultaneously. Store snapshots outside the cluster: etcdctl snapshot save /backup/etcd-$(date +%Y%m%d).db