Metrics Server & Prometheus Integration
Kubernetes has two separate metric systems that often get confused. Metrics Server serves the real-time resource snapshot the HPA and kubectl top need. Prometheus is a time-series database that scrapes metrics from every component in the cluster and stores them for querying and alerting. You almost always need both.
Two Metric Systems
| Metrics Server | Prometheus | |
|---|---|---|
| What it is | In-memory aggregator of kubelet resource stats | Time-series database with pull-based scraping |
| Retention | ~60 seconds (in memory only) | Configurable — days to years |
| Used by | HPA, VPA, kubectl top | Grafana dashboards, Alertmanager, custom tooling |
| Install | Single deployment, 1–2 replicas | Full stack: prometheus, alertmanager, exporters |
| Query | Kubernetes Metrics API | PromQL |
Metrics Server
Metrics Server scrapes resource usage (CPU/memory) from each node's kubelet Summary API every 60 seconds and serves them via the metrics.k8s.io API group. Install it with the official manifest:
# Install Metrics Server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# Verify it's running
kubectl get apiservice v1beta1.metrics.k8s.io
# Use it
kubectl top nodes
kubectl top pods -n production --sort-by=memory
On clusters without proper TLS (kind, minikube), Metrics Server fails because it can't verify kubelet certificates. Add --kubelet-insecure-tls to the Metrics Server container args to skip verification in dev environments only.
Prometheus Architecture
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set grafana.adminPassword=changeme \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi
kube-state-metrics
The kubelet exposes container CPU/memory usage. But it knows nothing about Deployment replicas, Pod phase, HPA target ratios, or node conditions. That's what kube-state-metrics adds — it watches the Kubernetes API and exposes object-level metrics as Prometheus gauges.
| Metric | What it tells you |
|---|---|
kube_deployment_status_replicas_available | Available replicas vs desired — spot degraded deployments. |
kube_pod_status_phase | Count of pods in Pending/Running/Failed/Succeeded per namespace. |
kube_node_status_condition | Node Ready, DiskPressure, MemoryPressure conditions. |
kube_persistentvolumeclaim_status_phase | Pending/Bound/Lost PVCs. |
kube_job_status_failed | Failed job runs — useful for CronJob alerting. |
Scrape Configs & ServiceMonitor
The Prometheus Operator introduces ServiceMonitor and PodMonitor CRDs. Instead of editing Prometheus config files, you declare what to scrape in a Kubernetes object.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: myapp
namespace: production
labels:
release: kube-prometheus-stack # must match Prometheus selector
spec:
selector:
matchLabels:
app: myapp # select Services with this label
endpoints:
- port: http # named port on the Service
path: /metrics
interval: 15s
scheme: http
apiVersion: v1
kind: Service
metadata:
name: myapp
namespace: production
labels:
app: myapp # must match ServiceMonitor selector
spec:
ports:
- name: http # port name must match ServiceMonitor
port: 8080
targetPort: 8080
selector:
app: myapp
PromQL Basics
# CPU usage per pod (cores)
sum by (pod, namespace) (
rate(container_cpu_usage_seconds_total{container!=""}[5m])
)
# Memory usage per pod (bytes)
sum by (pod, namespace) (
container_memory_working_set_bytes{container!=""}
)
# Deployment availability ratio
kube_deployment_status_replicas_available /
kube_deployment_spec_replicas
# HTTP error rate (requires app to expose http_requests_total)
rate(http_requests_total{status=~"5.."}[5m]) /
rate(http_requests_total[5m])
# Pods not running
count by (namespace, phase) (kube_pod_status_phase{phase!="Running", phase!="Succeeded"})
# Node memory pressure
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
Recording Rules & Alerting
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: myapp-alerts
namespace: production
labels:
release: kube-prometheus-stack
spec:
groups:
- name: myapp
interval: 30s
rules:
# Recording rule — pre-compute expensive query
- record: job:http_error_rate:ratio5m
expr: |
rate(http_requests_total{status=~"5.."}[5m]) /
rate(http_requests_total[5m])
# Alert rule — fire when error rate exceeds 1%
- alert: HighErrorRate
expr: job:http_error_rate:ratio5m > 0.01
for: 5m # must be true for 5 min before firing
labels:
severity: warning
team: backend
annotations:
summary: "High error rate on {{ $labels.job }}"
description: "Error rate is {{ $value | humanizePercentage }} over 5m."
kubectl Commands
# Check Metrics Server API availability
kubectl get apiservice v1beta1.metrics.k8s.io -o yaml | grep -A5 status
# Top nodes/pods
kubectl top nodes
kubectl top pods -A --sort-by=cpu
kubectl top pods -n production --sort-by=memory
# Port-forward Prometheus UI
kubectl port-forward svc/kube-prometheus-stack-prometheus 9090 -n monitoring
# Port-forward Grafana
kubectl port-forward svc/kube-prometheus-stack-grafana 3000 -n monitoring
# List all ServiceMonitors
kubectl get servicemonitor -A
# Check Prometheus targets (via UI at /targets after port-forward)
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job:.labels.job, health:.health}'