The DevOps engineer's handbook · v2026.04

Learn Kubernetes.
From Pods to production.

Fifty-two guides, built for engineers who have to ship and operate real clusters. Deep on concepts, honest on trade-offs, and cross-referenced with the official kubernetes.io docs on every page.

Open the first guide
8 Foundations live
52 Guides planned
12m Avg. read time
v1.30 K8s target
I

Foundations

The absolute baseline — what Kubernetes is, how it works, and how to talk to it. Start here if you're new, or skim as a refresher.

0/8 · 0%
8 guides
GUIDE · 01 Beginner

What Is Kubernetes?

The elevator pitch, the history lesson, and the honest answer to "do I actually need this?". Starts with the problem Kubernetes was built to solve.

10 min · 6 sections
You'll learn
  • The problem containers don't solve alone
  • Declarative vs. imperative orchestration
  • When K8s is the wrong answer
GUIDE · 02 Beginner

Kubernetes Architecture

Control plane vs. data plane, the five components that actually matter, and how an API request becomes a running container.

15 min · 8 sections
You'll learn
  • API server, scheduler, controller manager, etcd, kubelet
  • The reconcile loop, explained plainly
  • How nodes join and leave
GUIDE · 03 Beginner

Pods — The Atomic Unit

Why Pods, not containers, are the scheduling unit. Shared namespaces, sidecar patterns, init containers, and the lifecycle you have to know cold.

12 min · 9 sections
You'll learn
  • Shared network & IPC namespaces
  • Init containers & sidecars (with real examples)
  • Pod phases and when you actually care
GUIDE · 04 Beginner

Namespaces & Resource Isolation

How namespaces actually isolate resources (and where they don't). Conventions for multi-tenant clusters and why "default" is a trap.

10 min · 6 sections
You'll learn
  • What's scoped vs. cluster-wide
  • When to split by team, env, or app
  • ResourceQuota basics
GUIDE · 05 Beginner

Labels, Selectors & Annotations

The metadata that makes Kubernetes actually work. How selectors glue Services to Pods, and why annotations are for machines — not you.

10 min · 6 sections
You'll learn
  • Equality- vs. set-based selectors
  • Recommended label conventions (app.kubernetes.io/*)
  • When to reach for an annotation
GUIDE · 06 Beginner

kubectl — Control Plane CLI

The 20 commands you'll run 80% of the time, kubeconfig layered contexts, output formats that save you, and the aliases worth muscle-memorising.

12 min · 7 sections
You'll learn
  • Kubeconfig: contexts, clusters, users
  • JSONPath, go-template, and -o wide
  • Debug commands you wish you knew sooner
GUIDE · 07 Beginner

Running a Local Cluster

kind, minikube, and k3d compared. Which one to reach for by use case, port-forward patterns, and realistic resource sizing.

15 min · 7 sections
You'll learn
  • Choosing kind vs. minikube vs. k3d
  • Multi-node clusters on one laptop
  • Loading local images without a registry
GUIDE · 08 Beginner

YAML Manifests & Declarative Config

The five fields every resource has, apiVersion you can trust, the difference between apply and create, and GitOps-ready layout.

12 min · 7 sections
You'll learn
  • apiVersion, kind, metadata, spec, status
  • kubectl apply semantics & last-applied
  • Kustomize vs. Helm vs. raw manifests
II

Workloads

Managing stateless and stateful applications — Deployments, StatefulSets, Jobs, and autoscaling.

0/7 · 0%
GUIDE · 09Intermediate

Deployments & ReplicaSets

Rolling updates, rollback mechanics, and when to reach past the Deployment abstraction.

15 min

You'll learn

  • How ReplicaSets underpin every Deployment
  • Configure rolling update speed with maxSurge & maxUnavailable
  • Roll back to any previous revision in one command
  • Scale declaratively and with HPA
GUIDE · 10Intermediate

StatefulSets for Stateful Apps

Stable network identity, ordered updates, and the traps that bite when you try to "just use a Deployment".

15 min

You'll learn

  • Why databases need StatefulSets, not Deployments
  • Per-pod PVCs via volumeClaimTemplates
  • Headless Services and stable DNS per pod
  • Ordered scale-up/down and partition-based rollouts
GUIDE · 11Intermediate

DaemonSets & Node-level Workloads

Run one pod per node automatically — for log collectors, metrics exporters, and network plugins.

12 min

You'll learn

  • When DaemonSets beat Deployments for infrastructure agents
  • Target specific nodes with nodeSelector and affinity
  • Tolerate tainted control-plane and not-ready nodes
  • Host network, hostPath, and why they require care
GUIDE · 12Intermediate

Jobs & CronJobs

Run batch tasks to completion with automatic retries — and schedule recurring work with CronJobs.

12 min

You'll learn

  • How Jobs differ from Deployments — completions, not replicas
  • Parallel batch processing with parallelism + completions
  • backoffLimit, activeDeadlineSeconds, and cleanup
  • CronJob schedule syntax and concurrency policies
GUIDE · 13Intermediate

Resource Requests & Limits

Teach the scheduler what your pods need — and protect your nodes from runaway containers.

12 min

You'll learn

  • How requests drive scheduling and limits cap runtime usage
  • CPU throttling vs memory OOMKill — the asymmetry that matters
  • QoS classes and eviction order under memory pressure
  • LimitRange defaults and ResourceQuota namespace caps
GUIDE · 14Intermediate

Horizontal Pod Autoscaler

How HPA actually scales on CPU, memory, and custom metrics — and why yours isn't scaling.

15 min

You'll learn

  • The HPA control loop and scaling formula
  • Why CPU requests are required for HPA to work
  • Scale-up/down behavior and stabilisation windows
  • Custom metrics via prometheus-adapter
GUIDE · 15Intermediate

Liveness, Readiness & Startup Probes

Tell Kubernetes what healthy means — and avoid the restart cascades that probe misconfiguration causes.

12 min

You'll learn

  • Why liveness, readiness, and startup solve different problems
  • httpGet, tcpSocket, exec, and gRPC mechanisms
  • How readiness probes gate rolling updates
  • The five most dangerous probe misconfiguration patterns
III

Networking

How Kubernetes routes traffic inside and outside the cluster — Services, Ingress, DNS, Network Policies, CNI.

0/6 · 0%
GUIDE · 16Intermediate

Services — ClusterIP, NodePort, LoadBalancer

The four Service types, which picks which, and what kube-proxy is actually doing under the hood.

15 min

You'll learn

  • How kube-proxy rewrites traffic via iptables DNAT rules
  • ClusterIP for internal, NodePort for dev, LoadBalancer for cloud
  • ExternalName DNS aliases and headless Services for StatefulSets
  • Inspect endpoints to debug traffic routing
GUIDE · 17Intermediate

Ingress & Ingress Controllers

One load balancer, many services. Host and path routing, TLS termination, annotations, and when Gateway API wins.

15 min

You'll learn

  • Why Ingress beats a LoadBalancer per service
  • How the Ingress Controller reads rules and routes traffic
  • Host-based and path-based routing with pathType
  • TLS termination, cert-manager, and annotation pitfalls
GUIDE · 18Intermediate

Kubernetes DNS & Service Discovery

How CoreDNS resolves Service names, FQDN format, headless services, DNS policies, and debugging resolution failures.

12 min

You'll learn

  • How CoreDNS resolves short names using search domains
  • Service FQDN: <svc>.<ns>.svc.cluster.local
  • Headless services return per-pod A records instead of a VIP
  • ndots:5 and why it causes extra lookups for external names
GUIDE · 19Advanced

Network Policies

Default-deny, namespace isolation, ingress and egress rules — and which CNIs actually enforce them.

15 min

You'll learn

  • Why pods are fully open by default and how to close them down
  • podSelector, namespaceSelector, and ipBlock filtering
  • AND vs OR in from/to lists — the indentation trap
  • Default-deny pattern and DNS egress gotcha
GUIDE · 20Advanced

Gateway API

GatewayClass, Gateway, HTTPRoute — the Ingress successor with native traffic splitting, gRPC, and multi-tenant separation.

15 min

You'll learn

  • Why Ingress annotation sprawl led to Gateway API
  • Three-layer ownership: cluster admin, infra team, app team
  • Traffic splitting with weights for canary deployments
  • HTTPRoute, GRPCRoute, TCPRoute — protocols beyond HTTP
GUIDE · 21Advanced

CNI Plugins — Calico, Flannel, Cilium

How pods get IPs, overlay vs underlay networking, and what actually differs between Flannel, Calico, and eBPF-based Cilium.

20 min

You'll learn

  • The Kubernetes flat networking model and why CNI implements it
  • Per-node IPAM subnets and how pod IPs are allocated
  • VXLAN overlay vs BGP underlay — tradeoffs
  • Flannel (simple), Calico (policy), Cilium (eBPF + observability)
IV

Storage

Persistent data in a containerized world — Volumes, PVs, StorageClasses, ConfigMaps & Secrets.

0/5 · 0%
GUIDE · 22Intermediate

Volumes & Volume Mounts

emptyDir, hostPath, ConfigMap and Secret volumes — how containers share data and survive restarts without PersistentVolumes.

12 min

You'll learn

  • spec.volumes declarations and container volumeMounts wiring
  • emptyDir for sidecar log-shipping and shared scratch space
  • ConfigMap/Secret volumes update live; env vars don't
  • subPath mounts and why they break live-reload
GUIDE · 23Intermediate

PersistentVolumes & PersistentVolumeClaims

The PV/PVC binding lifecycle, access modes, reclaim policies, static vs dynamic provisioning, and why your pod is stuck Pending.

15 min

You'll learn

  • PV/PVC binding: storageClass, accessMode, capacity must all match
  • Reclaim policies — Retain keeps the disk; Delete removes it
  • RWO allows multiple pods same-node; use RWOP for true exclusivity
  • Released PVs need claimRef cleared before they rebind
GUIDE · 24Intermediate

StorageClasses & Dynamic Provisioning

How provisioners auto-create cloud disks on PVC submit — reclaimPolicy, volumeBindingMode, WaitForFirstConsumer, and volume expansion.

12 min

You'll learn

  • Dynamic provisioning: StorageClass calls cloud API, creates PV automatically
  • WaitForFirstConsumer prevents EBS/PD cross-AZ scheduling failures
  • Default reclaimPolicy is Delete — dangerous for production databases
  • allowVolumeExpansion: grow a PVC without downtime
GUIDE · 25Intermediate

ConfigMaps & Secrets

Env vars vs volume mounts, live update behavior, why Secrets aren't encrypted by default, and external secrets operators.

12 min

You'll learn

  • Volume mounts update live; env vars require pod restart
  • subPath mounts don't propagate updates — avoid for live config
  • Secrets are base64, not encrypted — enable etcd encryption at rest
  • External Secrets Operator syncs from Vault, AWS SM, GCP SM
GUIDE · 26Advanced

Storage Patterns for Stateful Apps

StatefulSet PVC templates, RWO limitations, volume snapshots, Velero backup, and when managed databases beat self-hosted.

20 min

You'll learn

  • volumeClaimTemplates give each StatefulSet pod its own PVC
  • RWO disk attachment delays when a node fails (6–10 min)
  • Volume snapshots: point-in-time PVC copies via CSI
  • Velero for full cluster backup and namespace restore
V

Security

Hardening workloads and controlling access — RBAC, Pod Security, Secrets at scale, supply chain.

0/6 · 0%
GUIDE · 27Intermediate

RBAC — Roles, Bindings & ServiceAccounts

Roles, ClusterRoles, bindings, ServiceAccount identity, least-privilege patterns, and debugging forbidden errors without losing a morning.

15 min

You'll learn

  • Role (namespace) vs ClusterRole (cluster-wide) and when each applies
  • ClusterRole + RoleBinding = namespace-scoped from a shared template
  • list secrets = reads all values; scope to resourceNames instead
  • kubectl auth can-i to test permissions without guessing
GUIDE · 28Advanced

Pod Security Standards & Admission

Restricted, baseline, privileged — how PSA enforces them via namespace labels without PSP's RBAC complexity.

15 min

You'll learn

  • Three PSS levels: privileged, baseline, restricted — what each blocks
  • Enforce, audit, warn modes — run all three during migration
  • securityContext fields required to pass restricted level
  • Migrating away from removed PodSecurityPolicy
GUIDE · 29Advanced

Managing Secrets at Scale

Encryption at rest, External Secrets Operator, Sealed Secrets, CSI driver — because base64 is not a security strategy.

15 min

You'll learn

  • Why K8s Secrets are not secret by default
  • External Secrets Operator — sync from Vault, AWS SM, GCP SM
  • Sealed Secrets — encrypt for git-safe storage
  • Rotation: file mounts update live, env vars do not
GUIDE · 30Advanced

Container Image Security

Scan, sign, verify — Trivy, Cosign, distroless images, and admission policies that block unsigned images before they run.

12 min

You'll learn

  • Trivy scanning in CI — fail on CRITICAL before push
  • Cosign keyless signing tied to GitHub Actions OIDC
  • Kyverno policy to verify signatures at admission time
  • Distroless and scratch images — reduce CVE surface to near zero
GUIDE · 31Advanced

mTLS & Service Mesh Security

Mutual TLS, Istio PeerAuthentication, sidecar proxy internals, and zero-trust patterns for pod-to-pod traffic.

20 min

You'll learn

  • Default K8s network is unencrypted and unauthenticated
  • mTLS: both sides present certs — identity + encryption in one
  • Istio STRICT mode — reject all plain-text pod-to-pod connections
  • AuthorizationPolicy: SPIFFE identity-based layer-7 access control
GUIDE · 32Advanced

Kubernetes Supply Chain Security

SLSA levels, SBOMs, Sigstore provenance, and Kyverno admission policies that verify the entire build chain before a pod runs.

20 min

You'll learn

  • SLSA levels 1–4 — what each requires and gives you
  • SBOMs: answer "which images have log4j?" in seconds
  • Sigstore keyless signing tied to GitHub Actions OIDC identity
  • Kyverno SLSA provenance attestation verification at admission
VI

Observability

Logging, metrics, tracing — and building dashboards engineers will actually use.

0/5 · 0%
GUIDE · 33Intermediate

Logging Architecture & Aggregation

How K8s logs work, Fluent Bit DaemonSet, Grafana Loki stack, structured logging, and retention strategies that won't bankrupt you.

12 min

You'll learn

  • Container stdout → node log file → aggregator pipeline
  • Fluent Bit DaemonSet config — tail, K8s metadata filter, Loki output
  • LogQL basics — stream, filter, JSON parse, rate queries
  • Filter health-check noise before shipping to cut volume 20–40%
GUIDE · 34Intermediate

Metrics Server & Prometheus Integration

Metrics Server vs Prometheus — what each is for, kube-state-metrics, ServiceMonitor CRDs, PromQL golden-signal queries, and PrometheusRule alerting.

15 min

You'll learn

  • Metrics Server: real-time snapshot for HPA and kubectl top
  • Prometheus: time-series DB — install via kube-prometheus-stack Helm
  • kube-state-metrics: deployment replicas, pod phase, node conditions
  • ServiceMonitor CRD — scrape your app's /metrics without editing configs
GUIDE · 35Advanced

Distributed Tracing with OpenTelemetry

OTel auto-instrumentation, Collector pipelines, tail-based sampling, Jaeger vs Tempo backends, and log–trace correlation in Grafana.

15 min

You'll learn

  • Trace = tree of spans; W3C traceparent propagates context between services
  • OTel Operator injects SDK — no code changes for Java/Python/Node
  • Tail sampling: always record errors and slow traces, sample the rest
  • Click a log line → jump to the exact trace span in Grafana
GUIDE · 36Intermediate

Grafana Dashboards for Kubernetes

Four golden signals panels, dashboard variables, deploy annotations, dashboard-as-code via ConfigMap provisioning, and on-call dashboard design.

15 min

You'll learn

  • Four golden signals: rate, errors, latency, saturation as first row
  • Variables for namespace/deployment — one dashboard per team
  • Deploy annotations: spot if the latency spike correlates with a rollout
  • ConfigMap provisioning — dashboards survive pod restarts
GUIDE · 37Intermediate

Events, Alerts & Incident Response

Alertmanager routing, inhibition, SLO burn-rate alerting, runbook structure, and the 5-step triage sequence that cuts MTTR in half.

12 min

You'll learn

  • kubectl get events — the first place to look, expires after 1h
  • Alertmanager: group, route by team label, inhibit child alerts
  • SLO burn-rate: page when budget burns 14× faster, not on arbitrary thresholds
  • 5-step triage: what changed → where broken → pod state → resources → connectivity
VII

Production Operations

Day-2 operations — upgrades, node management, GitOps, multi-cluster, DR.

0/7 · 0%
GUIDE · 38Advanced

Rolling Cluster Upgrades

Version skew rules, control plane first, kubeadm runbook, node drain sequence, and rollback planning for managed and self-managed clusters.

15 min

You'll learn

  • Kubelet can be 3 minor versions behind apiserver — workers upgrade last
  • Upgrade one minor version at a time — no skipping 1.28 → 1.30
  • kubeadm upgrade plan → apply → drain → node upgrade → uncordon
  • Pre-upgrade: pluto for deprecated API detection, etcd snapshot
GUIDE · 39Advanced

Node Pools & Node Management

Node pool design, taints for workload isolation, node affinity, Cluster Autoscaler, spot node strategies, and Node Problem Detector.

15 min

You'll learn

  • Taint GPU nodes: repel non-ML pods without touching their specs
  • Toleration alone doesn't guarantee placement — add nodeAffinity too
  • Cluster Autoscaler: HPA creates pods → CA adds nodes to fit them
  • Spot nodes: 60–90% cheaper; taint them so stateful workloads don't land
GUIDE · 40Advanced

Resource Quotas & LimitRanges

Namespace CPU/memory caps, per-container defaults via LimitRange, QoS classes, PriorityClasses, and multi-tenant quota strategy.

12 min

You'll learn

  • ResourceQuota: namespace total cap — set 2× expected peak per team
  • LimitRange: inject default requests/limits so no pod runs unbounded
  • Guaranteed QoS = requests == limits — last evicted under memory pressure
  • PriorityClass: production preempts batch when cluster is full
GUIDE · 41Advanced

PodDisruptionBudgets & Zero-Downtime

PDB minAvailable, rolling update maxUnavailable, preStop hooks, topology spread constraints, and readiness gates to eliminate deployment downtime.

12 min

You'll learn

  • PDB blocks drain if it would drop below minAvailable — drain waits, not fails
  • minAvailable: 100% blocks all drains — never set this
  • maxUnavailable: 0 + maxSurge: 1 = true zero-downtime rolling update
  • preStop sleep: let load balancer deregister before SIGTERM hits
GUIDE · 42Advanced

Multi-Cluster Patterns

Hub-and-spoke GitOps, active-active multi-region, ArgoCD ApplicationSet fleet deploys, cross-cluster service discovery, and kubeconfig management.

20 min

You'll learn

  • Hub-and-spoke: one ArgoCD cluster deploys to all member clusters
  • ApplicationSet: one template → one ArgoCD App per cluster automatically
  • svc.cluster.local is cluster-scoped — Submariner or Istio for cross-cluster
  • kubectx + merged kubeconfig for fast context switching across a fleet
GUIDE · 43Advanced

GitOps with Flux & ArgoCD

Reconcile loops, Flux CRDs vs ArgoCD Applications, repo layout, SOPS secrets, progressive delivery with Argo Rollouts, and drift detection.

20 min

You'll learn

  • Git is source of truth — in-cluster agent pulls, CI never pushes
  • Flux: source-controller + kustomize-controller + helm-controller
  • ArgoCD: Application CRD, selfHeal reverts manual kubectl changes
  • SOPS + age: encrypt secrets for git — cluster decrypts at apply time
GUIDE · 44Advanced

Backup & Disaster Recovery

etcd snapshots for control plane, Velero scheduled backups for workloads, restore procedures, RTO/RPO targets, and why GitOps simplifies DR.

15 min

You'll learn

  • etcd snapshot + restore — full control plane state at a point in time
  • Velero Schedule CRD — nightly backups with 30-day TTL to S3
  • etcd restore is destructive — all state rolls back to snapshot point
  • GitOps DR: new cluster + Flux bootstrap = workloads back in 20 min
VIII

Advanced Topics

Operators, extensibility, ecosystem deep-dives — the cluster becomes a platform.

0/8 · 0%
GUIDE · 45Advanced

Custom Resource Definitions (CRDs)

Register new API types, structural schema validation, status subresource, printer columns, and multi-version CRDs with conversion webhooks.

15 min

You'll learn

  • kubectl get mydatabase works exactly like kubectl get pods — same API
  • Structural schema: type every field — enables pruning and defaulting
  • Status subresource: only controllers can update status, not kubectl apply
  • Printer columns: kubectl get shows Phase, Engine, Replicas, not just NAME/AGE
GUIDE · 46Advanced

Building Kubernetes Operators

The reconcile loop pattern, kubebuilder scaffolding, owner references, finalizers, status conditions, and when a Helm chart is enough.

25 min

You'll learn

  • Reconcile() must be idempotent — called any number of times safely
  • Owner references: child resources GC'd automatically when CR is deleted
  • Finalizers: block deletion until external cleanup (RDS, DNS) is done
  • Check operatorhub.io first — cert-manager, postgres-operator already exist
GUIDE · 47Advanced

Admission Webhooks

Validating vs mutating webhooks, AdmissionReview wire format, TLS with cert-manager, failurePolicy lockout risks, and CEL ValidatingAdmissionPolicy.

20 min

You'll learn

  • Mutating runs before validating — shape is final by validation time
  • failurePolicy: Fail + webhook down = cluster lockout — exclude kube-system
  • CEL ValidatingAdmissionPolicy: no webhook server, no TLS, no process
  • cert-manager injects caBundle automatically via annotation
GUIDE · 48Advanced

Scheduler, Affinity & Taints

Filter and score pipeline, node and pod affinity, taints NoSchedule vs NoExecute, topology spread constraints, and custom scheduler profiles.

15 min

You'll learn

  • Filter removes infeasible nodes; Score ranks the rest — Pending = all filtered
  • Required affinity is a filter; preferred affinity is a score weight
  • NoSchedule repels new pods; NoExecute also evicts running pods
  • Topology spread: maxSkew: 1 across zones without hard per-zone limits
GUIDE · 49Advanced

Service Mesh with Istio

Istiod control plane, Envoy data plane, VirtualService routing, DestinationRule subsets, weighted canary splits, circuit breaking, and fault injection.

20 min

You'll learn

  • xDS API: Istiod pushes routing config to Envoy sidecars in real time
  • VirtualService: route by header, path, weight — independent of replica count
  • OutlierDetection: eject backends that return 5xx — automatic circuit breaking
  • Fault injection: add 500ms delay to 10% of requests without touching code
GUIDE · 50Advanced

GPU Workloads on Kubernetes

NVIDIA device plugin, GPU resource requests, time-slicing for shared access, MIG hardware partitioning, DCGM metrics, and spotting idle GPU waste.

20 min

You'll learn

  • GPU resources go in limits only — requests ignored by scheduler
  • Time-slicing: 4 pods share 1 GPU — no memory isolation, shared VRAM
  • MIG: A100 split into up to 7 isolated instances with dedicated memory
  • Alert when GPU util < 10% for 30 min — expensive idle allocation
GUIDE · 51Advanced

Knative & Serverless on Kubernetes

Knative Serving for scale-to-zero HTTP workloads, traffic splitting across revisions, Knative Eventing broker/trigger routing, and KEDA for Kafka/SQS-driven autoscaling.

20 min

You'll learn

  • Scale-to-zero: activator buffers requests during cold start (1–3s)
  • Knative revision = immutable snapshot; route splits traffic across revisions
  • KEDA ScaledObject: 0 → 30 pods based on Kafka lag, no code changes
  • Serverless fits bursty webhooks; wrong fit for WebSocket or stateful services
GUIDE · 52Advanced

Kubernetes Internals Deep Dive

API server admission pipeline, etcd watch mechanics, informer cache and work queue pattern, controller-manager loops, kubelet CRI flow, and iptables Service routing.

25 min

You'll learn

  • 6-stage API server pipeline: authn → authz → mutating → validate → validating → etcd
  • Informer = List+Watch + local cache — controllers never query API directly
  • Work queue deduplicates 100 updates into 1 reconcile — no API storms
  • kube-proxy: DNAT in kernel netfilter — no userspace proxy involved
Feedback