Observability

Events, Alerts & Incident Response

● Intermediate ⏱ 12 min read

Alerts that page without context, route to the wrong team, or fire constantly due to noise are worse than no alerts. This guide covers Kubernetes Events, Alertmanager routing, SLO-based alerting that pages on user impact rather than arbitrary thresholds, and the triage patterns that cut mean time to resolution in half.

Kubernetes Events

Kubernetes Events are API objects written by controllers when something notable happens — a pod scheduled, a container restarted, a probe failed, a PVC bound. They are the first place to look when diagnosing cluster-level issues.

# List all events in a namespace, newest first
kubectl get events -n production --sort-by='.lastTimestamp'

# Watch events in real time
kubectl get events -n production -w

# Events for a specific object
kubectl describe pod mypod -n production | grep -A20 "Events:"

# Events for a Deployment (via owner reference)
kubectl get events -n production \
  --field-selector involvedObject.kind=Deployment,involvedObject.name=myapp

# Events cluster-wide — all namespaces
kubectl get events -A --sort-by='.lastTimestamp' | tail -30
⚠️
Events expire after 1 hour by default

The API server keeps Events for 1 hour (--event-ttl) by default. For incident post-mortems you need longer retention. Use kubernetes-event-exporter or Loki to ship events to a persistent backend.

Alertmanager Architecture

Prometheus fires alerts to Alertmanager. Alertmanager handles grouping, routing, deduplication, and silencing before sending notifications. It is not an alert definition tool — alert rules live in Prometheus (or PrometheusRule CRs).

Alert pipeline — Prometheus → Alertmanager → receivers
PROMETHEUS
evaluates rules
fires alerts via webhook
ALERTMANAGER
1. group related alerts
2. apply routing tree
3. deduplicate
4. inhibit / silence
5. send notification
PagerDuty / OpsGenie
Slack channel
Email / webhook
Alertmanager groups, deduplicates, routes, and silences alerts before sending to receivers. One Alertmanager instance handles all Prometheus instances.

Alert Routing

alertmanager.yaml — route by team label
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/...'

route:
  receiver: default-slack
  group_by: [alertname, namespace, severity]
  group_wait: 30s         # wait before sending first notification
  group_interval: 5m      # how often to send grouped notifications
  repeat_interval: 4h     # re-notify if still firing

  routes:
  # Critical alerts → PagerDuty (page on-call)
  - matchers:
    - severity = critical
    receiver: pagerduty
    continue: false          # stop matching after first match

  # Backend team's alerts → their Slack channel
  - matchers:
    - team = backend
    receiver: slack-backend
    continue: true           # continue to also match parent route

receivers:
- name: default-slack
  slack_configs:
  - channel: '#alerts'
    title: '{{ .GroupLabels.alertname }}'
    text: '{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}'

- name: pagerduty
  pagerduty_configs:
  - routing_key: ''
    severity: '{{ .CommonLabels.severity }}'

- name: slack-backend
  slack_configs:
  - channel: '#backend-alerts'
    send_resolved: true

Inhibition & Silencing

Inhibition automatically suppresses lower-severity alerts when a higher-severity alert fires. If ClusterDown is firing, suppress all pod-level alerts — they are all consequences of the same root cause.

inhibit_rules — suppress child alerts during cluster outage
inhibit_rules:
- source_matchers:
  - alertname = "NodeNotReady"
  target_matchers:
  - severity =~ "warning|info"
  equal: [node]             # only inhibit alerts on the same node

- source_matchers:
  - alertname = "KubeAPIServerDown"
  target_matchers:
  - alertname =~ "Kube.*"   # suppress all Kube* alerts — API is down, everything looks broken

Silencing is manual — mute alerts during maintenance windows. Create silences via the Alertmanager UI or API:

# Create a silence via amtool (Alertmanager CLI)
amtool silence add \
  --alertmanager.url=http://alertmanager:9093 \
  --duration=2h \
  --comment="Scheduled maintenance window" \
  alertname="HighMemoryUsage" namespace="staging"

SLO-Based Alerting

Threshold-based alerts ("CPU > 80%") page frequently and often don't correlate with user impact. SLO-based alerting pages when the error budget is burning faster than sustainable.

A common model: a 99.9% availability SLO has an error budget of 43.8 minutes/month. If errors are consuming the budget at 14× the sustainable rate, you'll burn it in 3 days — that warrants a page.

multiwindow multi-burn-rate alert (Google SRE pattern)
groups:
- name: slo.myapp
  rules:
  # Error rate over short + long windows
  - alert: HighErrorBudgetBurn
    expr: |
      (
        job:http_error_rate:ratio5m > (14.4 * 0.001)   # 14.4x burn rate
        AND
        job:http_error_rate:ratio1h > (14.4 * 0.001)
      )
      OR
      (
        job:http_error_rate:ratio30m > (6 * 0.001)     # 6x burn rate
        AND
        job:http_error_rate:ratio6h > (6 * 0.001)
      )
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "High error budget burn rate on myapp"
      runbook: "https://wiki.example.com/runbooks/myapp-high-error-rate"

Runbooks

Every alert should link to a runbook. A runbook is a structured document answering: what does this alert mean, what are the likely causes, and what are the first 3 things to check?

runbook structure
# Alert: HighErrorRate

## What this means
HTTP 5xx rate exceeds 1% over 5 minutes for myapp in production.

## Likely causes
1. Bad deploy — check if a rollout happened recently (kubectl rollout history)
2. Downstream dependency down — check db/cache connectivity
3. OOM kills — check pod restarts (kubectl get pods -n production)

## Triage steps
1. kubectl get pods -n production -l app=myapp          # check pod status
2. kubectl logs -l app=myapp -n production --tail=50    # find the error
3. kubectl top pods -n production                       # check resource saturation
4. Check the Grafana dashboard: https://grafana/d/myapp

## Escalation
If not resolved in 15 minutes: page the backend lead.
Rollback: kubectl rollout undo deployment/myapp -n production

Incident Triage Patterns

A repeatable triage sequence reduces time-to-diagnosis regardless of which engineer is on call:

  1. What changed?kubectl rollout history, recent deploys, config changes, infrastructure events.
  2. Where is it broken? — scope to namespace, deployment, node. Use kubectl get pods -A to find unhealthy pods cluster-wide.
  3. What is the pod doing?kubectl describe pod for events; kubectl logs --previous for crash logs.
  4. Is it resources?kubectl top pods for CPU/memory; check for OOMKilled in describe output.
  5. Is it connectivity? — exec into the pod and curl the downstream service; check NetworkPolicy with kubectl get netpol.

kubectl Commands

# Find all non-running pods cluster-wide
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded

# Pods with high restart counts
kubectl get pods -A | awk 'NR>1 && $5 > 3 {print}'

# Check resource usage vs limits
kubectl top pods -n production

# Recent events for a namespace
kubectl get events -n production --sort-by='.lastTimestamp' | tail -20

# Describe a pod — events, resource limits, probe status
kubectl describe pod mypod -n production

# Rollback a deployment
kubectl rollout undo deployment/myapp -n production

# Check rollout history
kubectl rollout history deployment/myapp -n production