Kubernetes Internals Deep Dive
Most Kubernetes troubleshooting stays at the surface: check pod events, read logs, increase resource limits. But some problems require understanding what happens inside the control plane — why a controller doesn't react, why a Service isn't routing, why etcd is slow. This guide traces a kubectl apply all the way through the system to a running container.
API Server Request Lifecycle
Every kubectl command, operator reconcile, and controller watch goes through the API server. Understanding its pipeline explains admission webhook ordering, why some mutations don't appear immediately, and how API versioning works.
etcd Watch Mechanics
etcd is not a traditional database. It is a distributed key-value store built on the Raft consensus protocol. The API server is the only component that talks to etcd — all other components go through the API server.
The watch mechanism is how the entire Kubernetes control loop works:
- Clients (controllers, kubelet, kubectl) open a long-lived HTTP/2 stream to the API server:
GET /api/v1/pods?watch=true&resourceVersion=87432. - The API server maintains an watch cache — an in-memory copy of etcd state, populated via a single watch per resource type.
- When etcd fires a watch event (object created/updated/deleted), the API server fans it out to all connected watchers simultaneously.
- The
resourceVersionin the watch request ensures the client receives all events since its last known state — no events are missed during reconnects.
Informers & Work Queues
An informer is the standard Go library pattern for watching Kubernetes resources efficiently. It combines a List (initial state) + Watch (stream of changes) with a local in-memory cache (the Store/Indexer). Controllers never query the API server directly — they read from the informer's cache.
// Informer watches the API server and maintains a local cache
podInformer := factory.Core().V1().Pods()
podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) {
key, _ := cache.MetaNamespaceKeyFunc(obj)
queue.Add(key) // enqueue namespace/name key — NOT the object
},
UpdateFunc: func(old, new interface{}) {
key, _ := cache.MetaNamespaceKeyFunc(new)
queue.Add(key) // rate-limited: duplicate keys are deduplicated
},
DeleteFunc: func(obj interface{}) {
key, _ := cache.DeletionHandlingMetaNamespaceKeyFunc(obj)
queue.Add(key)
},
})
// Worker: process keys from the queue
func (c *Controller) processNextItem() {
key, quit := c.queue.Get()
defer c.queue.Done(key.(string))
// Read from the CACHE — not the API server
pod, exists, err := c.podIndexer.GetByKey(key.(string))
if !exists { /* handle deletion */ return }
// Reconcile based on cache state
c.reconcile(pod.(*corev1.Pod))
}
The work queue is rate-limited and deduplicating: if a pod is updated 100 times in 1 second, only one reconcile runs. This is why controllers are efficient even during API server storms.
Controller Manager
The kube-controller-manager is a single binary running ~40 controllers in goroutines. Key controllers and what they watch:
| Controller | Watches | Acts on |
|---|---|---|
| Deployment | Deployments, ReplicaSets, Pods | Creates/updates/deletes ReplicaSets to match desired replicas |
| ReplicaSet | ReplicaSets, Pods | Creates/deletes Pods to match spec.replicas |
| StatefulSet | StatefulSets, Pods, PVCs | Creates Pods in order; provisions PVCs per pod |
| Node lifecycle | Nodes, Pods | Taints unreachable nodes; evicts pods after tolerationSeconds |
| Endpoint | Services, Pods | Updates Endpoints object when pod readiness changes |
| GarbageCollection | All objects with ownerReferences | Deletes orphaned objects when owner is deleted |
Kubelet CRI Loop
The kubelet runs on every node. It watches the API server for pods scheduled to its node (spec.nodeName == thisNode) and drives the container runtime via the Container Runtime Interface (CRI).
# 1. Scheduler writes nodeName to the pod spec in etcd
# 2. Kubelet's pod informer fires — pod added to syncLoop
# 3. Kubelet admission checks (node-level)
# - resource fits on this node?
# - node selectors match?
# 4. Pull container images (if not cached)
# CRI call: ImageService.PullImage()
# 5. Create pod sandbox (pause container — network namespace)
# CRI call: RuntimeService.RunPodSandbox()
# CNI plugin called to configure pod network (IP, routes)
# 6. Start containers in order: init containers → app containers
# CRI call: RuntimeService.CreateContainer() + StartContainer()
# 7. Run post-start lifecycle hooks (if defined)
# 8. Probe loop begins: liveness, readiness, startup probes
# Readiness probe passes → kubelet patches pod status.conditions
# kube-proxy / EndpointSlice controller sees Ready → adds to Service endpoints
kube-proxy & iptables
kube-proxy watches Services and EndpointSlices and programs the kernel's netfilter (iptables or ipvs) rules on every node. When a pod connects to a Service ClusterIP, the kernel intercepts the packet before it leaves the node and rewrites the destination to one of the backend pod IPs — no userspace proxy involved.
# A Service with ClusterIP 10.96.10.50 pointing to 3 pods
# kube-proxy creates these iptables rules:
# PREROUTING/OUTPUT chain → KUBE-SERVICES
# KUBE-SERVICES: match ClusterIP → jump to KUBE-SVC-XXXX
# KUBE-SVC-XXXX: randomly select one of 3 endpoints (statistic module)
# 33% chance → KUBE-SEP-POD1 (DNAT to 10.244.1.5:8080)
# 50% of remaining → KUBE-SEP-POD2 (DNAT to 10.244.1.6:8080)
# 100% of remaining → KUBE-SEP-POD3 (DNAT to 10.244.2.3:8080)
# Inspect live rules
iptables-save | grep "10.96.10.50"
# Or use ipvs mode (lower overhead at scale)
ipvsadm -Ln | grep -A5 "10.96.10.50"
Garbage Collection
Kubernetes uses owner references to form a dependency DAG. The GC controller watches all objects and deletes orphans — objects whose owner no longer exists. Two deletion modes:
| Mode | Behaviour | Default for |
|---|---|---|
| Foreground | Owner gets a deletionTimestamp; GC deletes dependents first; owner deleted last. | Explicit --cascade=foreground |
| Background | Owner deleted immediately; GC deletes dependents asynchronously in background. | Most resources |
| Orphan | Owner deleted; dependents' ownerReferences cleared but they keep running. | Never by default |
Why This Matters Operationally
Understanding internals directly improves incident diagnosis:
- Controller not reacting? — Check the controller's work queue (controller-manager logs with
--v=4). The informer cache may be stale due to a list-watch error. - Service not routing? — kube-proxy may not have picked up a new Endpoint.
kubectl get endpointslices -n ns svc-name -o yamlshows the ground truth;iptables-save | grep clusteripshows what the kernel actually has. - Pod stuck in ContainerCreating? — Kubelet is waiting for CNI (network setup) or image pull.
kubectl describe podevents show exactly which CRI call is stalled. - etcd latency spike? — High p99 etcd write latency causes API server request timeouts and cascades into controller backlog. Monitor
etcd_disk_backend_commit_duration_seconds.