Events, Alerts & Incident Response
Alerts that page without context, route to the wrong team, or fire constantly due to noise are worse than no alerts. This guide covers Kubernetes Events, Alertmanager routing, SLO-based alerting that pages on user impact rather than arbitrary thresholds, and the triage patterns that cut mean time to resolution in half.
Kubernetes Events
Kubernetes Events are API objects written by controllers when something notable happens — a pod scheduled, a container restarted, a probe failed, a PVC bound. They are the first place to look when diagnosing cluster-level issues.
# List all events in a namespace, newest first
kubectl get events -n production --sort-by='.lastTimestamp'
# Watch events in real time
kubectl get events -n production -w
# Events for a specific object
kubectl describe pod mypod -n production | grep -A20 "Events:"
# Events for a Deployment (via owner reference)
kubectl get events -n production \
--field-selector involvedObject.kind=Deployment,involvedObject.name=myapp
# Events cluster-wide — all namespaces
kubectl get events -A --sort-by='.lastTimestamp' | tail -30
The API server keeps Events for 1 hour (--event-ttl) by default. For incident post-mortems you need longer retention. Use kubernetes-event-exporter or Loki to ship events to a persistent backend.
Alertmanager Architecture
Prometheus fires alerts to Alertmanager. Alertmanager handles grouping, routing, deduplication, and silencing before sending notifications. It is not an alert definition tool — alert rules live in Prometheus (or PrometheusRule CRs).
fires alerts via webhook
Alert Routing
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/...'
route:
receiver: default-slack
group_by: [alertname, namespace, severity]
group_wait: 30s # wait before sending first notification
group_interval: 5m # how often to send grouped notifications
repeat_interval: 4h # re-notify if still firing
routes:
# Critical alerts → PagerDuty (page on-call)
- matchers:
- severity = critical
receiver: pagerduty
continue: false # stop matching after first match
# Backend team's alerts → their Slack channel
- matchers:
- team = backend
receiver: slack-backend
continue: true # continue to also match parent route
receivers:
- name: default-slack
slack_configs:
- channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}'
- name: pagerduty
pagerduty_configs:
- routing_key: ''
severity: '{{ .CommonLabels.severity }}'
- name: slack-backend
slack_configs:
- channel: '#backend-alerts'
send_resolved: true
Inhibition & Silencing
Inhibition automatically suppresses lower-severity alerts when a higher-severity alert fires. If ClusterDown is firing, suppress all pod-level alerts — they are all consequences of the same root cause.
inhibit_rules:
- source_matchers:
- alertname = "NodeNotReady"
target_matchers:
- severity =~ "warning|info"
equal: [node] # only inhibit alerts on the same node
- source_matchers:
- alertname = "KubeAPIServerDown"
target_matchers:
- alertname =~ "Kube.*" # suppress all Kube* alerts — API is down, everything looks broken
Silencing is manual — mute alerts during maintenance windows. Create silences via the Alertmanager UI or API:
# Create a silence via amtool (Alertmanager CLI)
amtool silence add \
--alertmanager.url=http://alertmanager:9093 \
--duration=2h \
--comment="Scheduled maintenance window" \
alertname="HighMemoryUsage" namespace="staging"
SLO-Based Alerting
Threshold-based alerts ("CPU > 80%") page frequently and often don't correlate with user impact. SLO-based alerting pages when the error budget is burning faster than sustainable.
A common model: a 99.9% availability SLO has an error budget of 43.8 minutes/month. If errors are consuming the budget at 14× the sustainable rate, you'll burn it in 3 days — that warrants a page.
groups:
- name: slo.myapp
rules:
# Error rate over short + long windows
- alert: HighErrorBudgetBurn
expr: |
(
job:http_error_rate:ratio5m > (14.4 * 0.001) # 14.4x burn rate
AND
job:http_error_rate:ratio1h > (14.4 * 0.001)
)
OR
(
job:http_error_rate:ratio30m > (6 * 0.001) # 6x burn rate
AND
job:http_error_rate:ratio6h > (6 * 0.001)
)
for: 2m
labels:
severity: critical
annotations:
summary: "High error budget burn rate on myapp"
runbook: "https://wiki.example.com/runbooks/myapp-high-error-rate"
Runbooks
Every alert should link to a runbook. A runbook is a structured document answering: what does this alert mean, what are the likely causes, and what are the first 3 things to check?
# Alert: HighErrorRate
## What this means
HTTP 5xx rate exceeds 1% over 5 minutes for myapp in production.
## Likely causes
1. Bad deploy — check if a rollout happened recently (kubectl rollout history)
2. Downstream dependency down — check db/cache connectivity
3. OOM kills — check pod restarts (kubectl get pods -n production)
## Triage steps
1. kubectl get pods -n production -l app=myapp # check pod status
2. kubectl logs -l app=myapp -n production --tail=50 # find the error
3. kubectl top pods -n production # check resource saturation
4. Check the Grafana dashboard: https://grafana/d/myapp
## Escalation
If not resolved in 15 minutes: page the backend lead.
Rollback: kubectl rollout undo deployment/myapp -n production
Incident Triage Patterns
A repeatable triage sequence reduces time-to-diagnosis regardless of which engineer is on call:
- What changed? —
kubectl rollout history, recent deploys, config changes, infrastructure events. - Where is it broken? — scope to namespace, deployment, node. Use
kubectl get pods -Ato find unhealthy pods cluster-wide. - What is the pod doing? —
kubectl describe podfor events;kubectl logs --previousfor crash logs. - Is it resources? —
kubectl top podsfor CPU/memory; check for OOMKilled in describe output. - Is it connectivity? — exec into the pod and curl the downstream service; check NetworkPolicy with
kubectl get netpol.
kubectl Commands
# Find all non-running pods cluster-wide
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
# Pods with high restart counts
kubectl get pods -A | awk 'NR>1 && $5 > 3 {print}'
# Check resource usage vs limits
kubectl top pods -n production
# Recent events for a namespace
kubectl get events -n production --sort-by='.lastTimestamp' | tail -20
# Describe a pod — events, resource limits, probe status
kubectl describe pod mypod -n production
# Rollback a deployment
kubectl rollout undo deployment/myapp -n production
# Check rollout history
kubectl rollout history deployment/myapp -n production