Observability

Grafana Dashboards for Kubernetes

● Intermediate ⏱ 15 min read

A dashboard nobody looks at is worse than no dashboard — it creates false confidence. Grafana is the standard visualization layer for Kubernetes observability, but building dashboards that engineers actually use during incidents requires deliberate design. This guide covers the essential panels, how to ship dashboards as code, and what the on-call dashboard needs to answer in under 30 seconds.

Built-in Dashboards

The kube-prometheus-stack Helm chart ships ~30 pre-built dashboards. The most useful out of the box:

Dashboard	What to look at
Kubernetes / Compute Resources / Cluster	CPU/memory requests vs limits cluster-wide. Spot over-committed nodes.
Kubernetes / Compute Resources / Namespace	Per-namespace resource breakdown. Find the namespace eating quota.
Kubernetes / Compute Resources / Pod	Container-level CPU throttling and OOM history for a specific pod.
Node Exporter / Nodes	Disk I/O, network saturation, CPU steal — the OS layer below Kubernetes.
Kubernetes / Persistent Volumes	PVC usage percentage. Alert before disks fill.

Essential K8s Panels

For a custom service dashboard, these panels cover the four golden signals (latency, traffic, errors, saturation):

Service dashboard layout — four golden signals

REQUEST RATE (traffic)

rate(http_requests_total[5m])

Panel type: time series

ERROR RATE

rate(…{status=~"5.."}[5m]) / rate(…[5m])

Panel type: time series + threshold

LATENCY (p50 / p95 / p99)

histogram_quantile(0.99, rate(…bucket[5m]))

Panel type: time series

SATURATION (CPU throttle)

rate(container_cpu_cfs_throttled…[5m])

Panel type: gauge + stat

Add a Deployment replicas stat panel and a Recent events table below — these are the first things on-call looks at.

Four golden signals as the first row. Add saturation (CPU throttle, memory utilization) and K8s-specific panels (replica count, pod restarts) below.

Dashboard Variables

Variables make a single dashboard reusable across namespaces, services, and clusters. Define them in Dashboard Settings → Variables.

common Grafana variables for K8s dashboards

# Namespace variable
Name: namespace
Query: label_values(kube_pod_info, namespace)
Refresh: On time range change

# Deployment variable (filtered to selected namespace)
Name: deployment
Query: label_values(kube_deployment_spec_replicas{namespace="$namespace"}, deployment)
Refresh: On time range change

# Use in panel queries:
rate(http_requests_total{namespace="$namespace", deployment="$deployment"}[5m])

Annotations

Annotations overlay events on time-series panels — deployments, restarts, config changes. This makes it immediately obvious whether a latency spike correlates with a recent deploy.

deployment annotation — mark deploys on all panels

# In Grafana: Dashboard Settings → Annotations → Add annotation query
# Data source: Prometheus

Query:
  changes(kube_deployment_spec_replicas{namespace="$namespace",deployment="$deployment"}[2m]) > 0
  OR
  changes(kube_deployment_status_observed_generation{namespace="$namespace",deployment="$deployment"}[2m]) > 0

Title: Deploy
Tags: deployment

Dashboard as Code

Dashboards clicked together in the Grafana UI are fragile — they live in the database, can't be code-reviewed, and get lost when the pod restarts. Manage dashboards as code using JSON files or Grafonnet (a Jsonnet library).

export and commit a dashboard

# Export from Grafana UI: Dashboard → Share → Export → Save to file
# Commit the JSON to git under monitoring/dashboards/

# The JSON is the source of truth — never edit in the UI and forget to export
git add monitoring/dashboards/myapp.json
git commit -m "feat(monitoring): add myapp four-golden-signals dashboard"

Provisioning via ConfigMap

Grafana's provisioning system loads dashboards from disk on startup. Mount a ConfigMap containing the dashboard JSON — no manual import needed after a pod restart or cluster rebuild.

ConfigMap — auto-provision a dashboard

apiVersion: v1
kind: ConfigMap
metadata:
  name: myapp-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"    # kube-prometheus-stack sidecar watches this label
data:
  myapp.json: |
    {
      "title": "MyApp — Golden Signals",
      "uid": "myapp-golden",
      "tags": ["myapp", "production"],
      ...
    }

The grafana-sidecar container in the kube-prometheus-stack Grafana pod watches for ConfigMaps with the grafana_dashboard: "1" label and hot-reloads them — no pod restart needed.

Dashboard Organization

At more than a dozen dashboards, discoverability becomes the problem. A useful folder structure:

Grafana folders:
├── Kubernetes/           ← built-in cluster dashboards (from kube-prometheus-stack)
│   ├── Cluster Overview
│   ├── Node Exporter
│   └── Persistent Volumes
├── Services/             ← per-service dashboards (one per team/app)
│   ├── order-svc
│   ├── inventory-svc
│   └── api-gateway
└── On-Call/              ← triage dashboards, always visible on the NOC screen
    ├── Cluster Health
    └── SLO Overview

On-Call Dashboard Design

The on-call dashboard is the one opened during a 3am page. Design it for triage, not exploration:

One row per service — error rate (stat + sparkline), p99 latency (stat), replica count (stat). Green/red at a glance.
Time range defaults to 1h — long enough to see the incident start, short enough to load fast.
No variables required — all namespaces and critical services are always visible. Variables are for exploration, not triage.
Link to runbooks — panel titles or descriptions link to the runbook for that alert. The dashboard is the entry point.
Refresh every 30s — the incident is live; stale data is worse than no data.

💡

Separate exploration from triage dashboards

Dense exploration dashboards with many variables and panels are useful for investigation. They are not useful when paged at 3am. Keep the on-call dashboard simple and fast-loading. Use separate investigation dashboards linked from the alert annotations.