Grafana Dashboards for Kubernetes
A dashboard nobody looks at is worse than no dashboard — it creates false confidence. Grafana is the standard visualization layer for Kubernetes observability, but building dashboards that engineers actually use during incidents requires deliberate design. This guide covers the essential panels, how to ship dashboards as code, and what the on-call dashboard needs to answer in under 30 seconds.
Built-in Dashboards
The kube-prometheus-stack Helm chart ships ~30 pre-built dashboards. The most useful out of the box:
| Dashboard | What to look at |
|---|---|
| Kubernetes / Compute Resources / Cluster | CPU/memory requests vs limits cluster-wide. Spot over-committed nodes. |
| Kubernetes / Compute Resources / Namespace | Per-namespace resource breakdown. Find the namespace eating quota. |
| Kubernetes / Compute Resources / Pod | Container-level CPU throttling and OOM history for a specific pod. |
| Node Exporter / Nodes | Disk I/O, network saturation, CPU steal — the OS layer below Kubernetes. |
| Kubernetes / Persistent Volumes | PVC usage percentage. Alert before disks fill. |
Essential K8s Panels
For a custom service dashboard, these panels cover the four golden signals (latency, traffic, errors, saturation):
Dashboard Variables
Variables make a single dashboard reusable across namespaces, services, and clusters. Define them in Dashboard Settings → Variables.
# Namespace variable
Name: namespace
Query: label_values(kube_pod_info, namespace)
Refresh: On time range change
# Deployment variable (filtered to selected namespace)
Name: deployment
Query: label_values(kube_deployment_spec_replicas{namespace="$namespace"}, deployment)
Refresh: On time range change
# Use in panel queries:
rate(http_requests_total{namespace="$namespace", deployment="$deployment"}[5m])
Annotations
Annotations overlay events on time-series panels — deployments, restarts, config changes. This makes it immediately obvious whether a latency spike correlates with a recent deploy.
# In Grafana: Dashboard Settings → Annotations → Add annotation query
# Data source: Prometheus
Query:
changes(kube_deployment_spec_replicas{namespace="$namespace",deployment="$deployment"}[2m]) > 0
OR
changes(kube_deployment_status_observed_generation{namespace="$namespace",deployment="$deployment"}[2m]) > 0
Title: Deploy
Tags: deployment
Dashboard as Code
Dashboards clicked together in the Grafana UI are fragile — they live in the database, can't be code-reviewed, and get lost when the pod restarts. Manage dashboards as code using JSON files or Grafonnet (a Jsonnet library).
# Export from Grafana UI: Dashboard → Share → Export → Save to file
# Commit the JSON to git under monitoring/dashboards/
# The JSON is the source of truth — never edit in the UI and forget to export
git add monitoring/dashboards/myapp.json
git commit -m "feat(monitoring): add myapp four-golden-signals dashboard"
Provisioning via ConfigMap
Grafana's provisioning system loads dashboards from disk on startup. Mount a ConfigMap containing the dashboard JSON — no manual import needed after a pod restart or cluster rebuild.
apiVersion: v1
kind: ConfigMap
metadata:
name: myapp-dashboard
namespace: monitoring
labels:
grafana_dashboard: "1" # kube-prometheus-stack sidecar watches this label
data:
myapp.json: |
{
"title": "MyApp — Golden Signals",
"uid": "myapp-golden",
"tags": ["myapp", "production"],
...
}
The grafana-sidecar container in the kube-prometheus-stack Grafana pod watches for ConfigMaps with the grafana_dashboard: "1" label and hot-reloads them — no pod restart needed.
Dashboard Organization
At more than a dozen dashboards, discoverability becomes the problem. A useful folder structure:
Grafana folders:
├── Kubernetes/ ← built-in cluster dashboards (from kube-prometheus-stack)
│ ├── Cluster Overview
│ ├── Node Exporter
│ └── Persistent Volumes
├── Services/ ← per-service dashboards (one per team/app)
│ ├── order-svc
│ ├── inventory-svc
│ └── api-gateway
└── On-Call/ ← triage dashboards, always visible on the NOC screen
├── Cluster Health
└── SLO Overview
On-Call Dashboard Design
The on-call dashboard is the one opened during a 3am page. Design it for triage, not exploration:
- One row per service — error rate (stat + sparkline), p99 latency (stat), replica count (stat). Green/red at a glance.
- Time range defaults to 1h — long enough to see the incident start, short enough to load fast.
- No variables required — all namespaces and critical services are always visible. Variables are for exploration, not triage.
- Link to runbooks — panel titles or descriptions link to the runbook for that alert. The dashboard is the entry point.
- Refresh every 30s — the incident is live; stale data is worse than no data.
Dense exploration dashboards with many variables and panels are useful for investigation. They are not useful when paged at 3am. Keep the on-call dashboard simple and fast-loading. Use separate investigation dashboards linked from the alert annotations.