Advanced Topics

Kubernetes Internals Deep Dive

● Advanced ⏱ 25 min read

Most Kubernetes troubleshooting stays at the surface: check pod events, read logs, increase resource limits. But some problems require understanding what happens inside the control plane — why a controller doesn't react, why a Service isn't routing, why etcd is slow. This guide traces a kubectl apply all the way through the system to a running container.

API Server Request Lifecycle

Every kubectl command, operator reconcile, and controller watch goes through the API server. Understanding its pipeline explains admission webhook ordering, why some mutations don't appear immediately, and how API versioning works.

kubectl apply — from HTTP request to etcd write
1
TLS + Authentication — verify client cert / bearer token / OIDC JWT. Identity established.
2
Authorization — RBAC check: can this user/SA perform this verb on this resource in this namespace?
3
Mutating admission webhooks — called in parallel, responses applied in order. May inject sidecars, add labels, set defaults.
4
Object schema validation — OpenAPI v3 structural schema + CRD validation. Unknown fields pruned.
5
Validating admission webhooks + CEL policies — final gate. Reject or allow. No mutations permitted.
6
etcd write — object persisted with resourceVersion. Watch events fired to all watchers.
API server pipeline: authn → authz → mutating admission → schema validation → validating admission → etcd write → watch events fired. Each stage can reject the request.

etcd Watch Mechanics

etcd is not a traditional database. It is a distributed key-value store built on the Raft consensus protocol. The API server is the only component that talks to etcd — all other components go through the API server.

The watch mechanism is how the entire Kubernetes control loop works:

Informers & Work Queues

An informer is the standard Go library pattern for watching Kubernetes resources efficiently. It combines a List (initial state) + Watch (stream of changes) with a local in-memory cache (the Store/Indexer). Controllers never query the API server directly — they read from the informer's cache.

informer + work queue — controller pattern
// Informer watches the API server and maintains a local cache
podInformer := factory.Core().V1().Pods()
podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
    AddFunc: func(obj interface{}) {
        key, _ := cache.MetaNamespaceKeyFunc(obj)
        queue.Add(key)             // enqueue namespace/name key — NOT the object
    },
    UpdateFunc: func(old, new interface{}) {
        key, _ := cache.MetaNamespaceKeyFunc(new)
        queue.Add(key)             // rate-limited: duplicate keys are deduplicated
    },
    DeleteFunc: func(obj interface{}) {
        key, _ := cache.DeletionHandlingMetaNamespaceKeyFunc(obj)
        queue.Add(key)
    },
})

// Worker: process keys from the queue
func (c *Controller) processNextItem() {
    key, quit := c.queue.Get()
    defer c.queue.Done(key.(string))

    // Read from the CACHE — not the API server
    pod, exists, err := c.podIndexer.GetByKey(key.(string))
    if !exists { /* handle deletion */ return }

    // Reconcile based on cache state
    c.reconcile(pod.(*corev1.Pod))
}

The work queue is rate-limited and deduplicating: if a pod is updated 100 times in 1 second, only one reconcile runs. This is why controllers are efficient even during API server storms.

Controller Manager

The kube-controller-manager is a single binary running ~40 controllers in goroutines. Key controllers and what they watch:

ControllerWatchesActs on
DeploymentDeployments, ReplicaSets, PodsCreates/updates/deletes ReplicaSets to match desired replicas
ReplicaSetReplicaSets, PodsCreates/deletes Pods to match spec.replicas
StatefulSetStatefulSets, Pods, PVCsCreates Pods in order; provisions PVCs per pod
Node lifecycleNodes, PodsTaints unreachable nodes; evicts pods after tolerationSeconds
EndpointServices, PodsUpdates Endpoints object when pod readiness changes
GarbageCollectionAll objects with ownerReferencesDeletes orphaned objects when owner is deleted

Kubelet CRI Loop

The kubelet runs on every node. It watches the API server for pods scheduled to its node (spec.nodeName == thisNode) and drives the container runtime via the Container Runtime Interface (CRI).

what happens when a pod is scheduled to a node
# 1. Scheduler writes nodeName to the pod spec in etcd
# 2. Kubelet's pod informer fires — pod added to syncLoop

# 3. Kubelet admission checks (node-level)
#    - resource fits on this node?
#    - node selectors match?

# 4. Pull container images (if not cached)
#    CRI call: ImageService.PullImage()

# 5. Create pod sandbox (pause container — network namespace)
#    CRI call: RuntimeService.RunPodSandbox()
#    CNI plugin called to configure pod network (IP, routes)

# 6. Start containers in order: init containers → app containers
#    CRI call: RuntimeService.CreateContainer() + StartContainer()

# 7. Run post-start lifecycle hooks (if defined)

# 8. Probe loop begins: liveness, readiness, startup probes
#    Readiness probe passes → kubelet patches pod status.conditions
#    kube-proxy / EndpointSlice controller sees Ready → adds to Service endpoints

kube-proxy & iptables

kube-proxy watches Services and EndpointSlices and programs the kernel's netfilter (iptables or ipvs) rules on every node. When a pod connects to a Service ClusterIP, the kernel intercepts the packet before it leaves the node and rewrites the destination to one of the backend pod IPs — no userspace proxy involved.

trace Service routing in iptables
# A Service with ClusterIP 10.96.10.50 pointing to 3 pods
# kube-proxy creates these iptables rules:

# PREROUTING/OUTPUT chain → KUBE-SERVICES
# KUBE-SERVICES: match ClusterIP → jump to KUBE-SVC-XXXX
# KUBE-SVC-XXXX: randomly select one of 3 endpoints (statistic module)
#   33% chance → KUBE-SEP-POD1 (DNAT to 10.244.1.5:8080)
#   50% of remaining → KUBE-SEP-POD2 (DNAT to 10.244.1.6:8080)
#   100% of remaining → KUBE-SEP-POD3 (DNAT to 10.244.2.3:8080)

# Inspect live rules
iptables-save | grep "10.96.10.50"

# Or use ipvs mode (lower overhead at scale)
ipvsadm -Ln | grep -A5 "10.96.10.50"

Garbage Collection

Kubernetes uses owner references to form a dependency DAG. The GC controller watches all objects and deletes orphans — objects whose owner no longer exists. Two deletion modes:

ModeBehaviourDefault for
ForegroundOwner gets a deletionTimestamp; GC deletes dependents first; owner deleted last.Explicit --cascade=foreground
BackgroundOwner deleted immediately; GC deletes dependents asynchronously in background.Most resources
OrphanOwner deleted; dependents' ownerReferences cleared but they keep running.Never by default

Why This Matters Operationally

Understanding internals directly improves incident diagnosis: