Advanced Topics

Kubernetes Internals Deep Dive

● Advanced ⏱ 25 min read

Most Kubernetes troubleshooting stays at the surface: check pod events, read logs, increase resource limits. But some problems require understanding what happens inside the control plane — why a controller doesn't react, why a Service isn't routing, why etcd is slow. This guide traces a kubectl apply all the way through the system to a running container.

API Server Request Lifecycle

Every kubectl command, operator reconcile, and controller watch goes through the API server. Understanding its pipeline explains admission webhook ordering, why some mutations don't appear immediately, and how API versioning works.

kubectl apply — from HTTP request to etcd write

TLS + Authentication — verify client cert / bearer token / OIDC JWT. Identity established.

Authorization — RBAC check: can this user/SA perform this verb on this resource in this namespace?

Mutating admission webhooks — called in parallel, responses applied in order. May inject sidecars, add labels, set defaults.

Object schema validation — OpenAPI v3 structural schema + CRD validation. Unknown fields pruned.

Validating admission webhooks + CEL policies — final gate. Reject or allow. No mutations permitted.

etcd write — object persisted with resourceVersion. Watch events fired to all watchers.

API server pipeline: authn → authz → mutating admission → schema validation → validating admission → etcd write → watch events fired. Each stage can reject the request.

etcd Watch Mechanics

etcd is not a traditional database. It is a distributed key-value store built on the Raft consensus protocol. The API server is the only component that talks to etcd — all other components go through the API server.

The watch mechanism is how the entire Kubernetes control loop works:

Clients (controllers, kubelet, kubectl) open a long-lived HTTP/2 stream to the API server: GET /api/v1/pods?watch=true&resourceVersion=87432.
The API server maintains an watch cache — an in-memory copy of etcd state, populated via a single watch per resource type.
When etcd fires a watch event (object created/updated/deleted), the API server fans it out to all connected watchers simultaneously.
The resourceVersion in the watch request ensures the client receives all events since its last known state — no events are missed during reconnects.

Informers & Work Queues

An informer is the standard Go library pattern for watching Kubernetes resources efficiently. It combines a List (initial state) + Watch (stream of changes) with a local in-memory cache (the Store/Indexer). Controllers never query the API server directly — they read from the informer's cache.

informer + work queue — controller pattern

// Informer watches the API server and maintains a local cache
podInformer := factory.Core().V1().Pods()
podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
    AddFunc: func(obj interface{}) {
        key, _ := cache.MetaNamespaceKeyFunc(obj)
        queue.Add(key)             // enqueue namespace/name key — NOT the object
    },
    UpdateFunc: func(old, new interface{}) {
        key, _ := cache.MetaNamespaceKeyFunc(new)
        queue.Add(key)             // rate-limited: duplicate keys are deduplicated
    },
    DeleteFunc: func(obj interface{}) {
        key, _ := cache.DeletionHandlingMetaNamespaceKeyFunc(obj)
        queue.Add(key)
    },
})

// Worker: process keys from the queue
func (c *Controller) processNextItem() {
    key, quit := c.queue.Get()
    defer c.queue.Done(key.(string))

    // Read from the CACHE — not the API server
    pod, exists, err := c.podIndexer.GetByKey(key.(string))
    if !exists { /* handle deletion */ return }

    // Reconcile based on cache state
    c.reconcile(pod.(*corev1.Pod))
}

The work queue is rate-limited and deduplicating: if a pod is updated 100 times in 1 second, only one reconcile runs. This is why controllers are efficient even during API server storms.

Controller Manager

The kube-controller-manager is a single binary running ~40 controllers in goroutines. Key controllers and what they watch:

Controller	Watches	Acts on
Deployment	Deployments, ReplicaSets, Pods	Creates/updates/deletes ReplicaSets to match desired replicas
ReplicaSet	ReplicaSets, Pods	Creates/deletes Pods to match spec.replicas
StatefulSet	StatefulSets, Pods, PVCs	Creates Pods in order; provisions PVCs per pod
Node lifecycle	Nodes, Pods	Taints unreachable nodes; evicts pods after tolerationSeconds
Endpoint	Services, Pods	Updates Endpoints object when pod readiness changes
GarbageCollection	All objects with ownerReferences	Deletes orphaned objects when owner is deleted

Kubelet CRI Loop

The kubelet runs on every node. It watches the API server for pods scheduled to its node (spec.nodeName == thisNode) and drives the container runtime via the Container Runtime Interface (CRI).

what happens when a pod is scheduled to a node

# 1. Scheduler writes nodeName to the pod spec in etcd
# 2. Kubelet's pod informer fires — pod added to syncLoop

# 3. Kubelet admission checks (node-level)
#    - resource fits on this node?
#    - node selectors match?

# 4. Pull container images (if not cached)
#    CRI call: ImageService.PullImage()

# 5. Create pod sandbox (pause container — network namespace)
#    CRI call: RuntimeService.RunPodSandbox()
#    CNI plugin called to configure pod network (IP, routes)

# 6. Start containers in order: init containers → app containers
#    CRI call: RuntimeService.CreateContainer() + StartContainer()

# 7. Run post-start lifecycle hooks (if defined)

# 8. Probe loop begins: liveness, readiness, startup probes
#    Readiness probe passes → kubelet patches pod status.conditions
#    kube-proxy / EndpointSlice controller sees Ready → adds to Service endpoints

kube-proxy & iptables

kube-proxy watches Services and EndpointSlices and programs the kernel's netfilter (iptables or ipvs) rules on every node. When a pod connects to a Service ClusterIP, the kernel intercepts the packet before it leaves the node and rewrites the destination to one of the backend pod IPs — no userspace proxy involved.

trace Service routing in iptables

# A Service with ClusterIP 10.96.10.50 pointing to 3 pods
# kube-proxy creates these iptables rules:

# PREROUTING/OUTPUT chain → KUBE-SERVICES
# KUBE-SERVICES: match ClusterIP → jump to KUBE-SVC-XXXX
# KUBE-SVC-XXXX: randomly select one of 3 endpoints (statistic module)
#   33% chance → KUBE-SEP-POD1 (DNAT to 10.244.1.5:8080)
#   50% of remaining → KUBE-SEP-POD2 (DNAT to 10.244.1.6:8080)
#   100% of remaining → KUBE-SEP-POD3 (DNAT to 10.244.2.3:8080)

# Inspect live rules
iptables-save | grep "10.96.10.50"

# Or use ipvs mode (lower overhead at scale)
ipvsadm -Ln | grep -A5 "10.96.10.50"

Garbage Collection

Kubernetes uses owner references to form a dependency DAG. The GC controller watches all objects and deletes orphans — objects whose owner no longer exists. Two deletion modes:

Mode	Behaviour	Default for
Foreground	Owner gets a `deletionTimestamp`; GC deletes dependents first; owner deleted last.	Explicit `--cascade=foreground`
Background	Owner deleted immediately; GC deletes dependents asynchronously in background.	Most resources
Orphan	Owner deleted; dependents' ownerReferences cleared but they keep running.	Never by default

Why This Matters Operationally

Understanding internals directly improves incident diagnosis:

Controller not reacting? — Check the controller's work queue (controller-manager logs with --v=4). The informer cache may be stale due to a list-watch error.
Service not routing? — kube-proxy may not have picked up a new Endpoint. kubectl get endpointslices -n ns svc-name -o yaml shows the ground truth; iptables-save | grep clusterip shows what the kernel actually has.
Pod stuck in ContainerCreating? — Kubelet is waiting for CNI (network setup) or image pull. kubectl describe pod events show exactly which CRI call is stalled.
etcd latency spike? — High p99 etcd write latency causes API server request timeouts and cascades into controller backlog. Monitor etcd_disk_backend_commit_duration_seconds.