GPU Workloads on Kubernetes
GPUs are expensive, and idle GPUs are wasted money. Kubernetes manages GPU allocation through the device plugin framework — GPUs are exposed as schedulable resources just like CPU and memory. This guide covers the full stack: device plugin installation, resource requests, multi-instance GPU partitioning, and monitoring GPU utilisation to catch idle allocations.
NVIDIA Device Plugin
The NVIDIA device plugin runs as a DaemonSet on GPU nodes. It discovers GPUs on each node via the NVIDIA Container Toolkit, registers them with the kubelet as nvidia.com/gpu resources, and configures containers to access the GPU device files.
# Prerequisites on GPU nodes: NVIDIA drivers + nvidia-container-toolkit installed
# Verify on a GPU node:
nvidia-smi
# Install device plugin via Helm
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install nvidia-device-plugin nvidia/nvidia-device-plugin \
--namespace kube-system \
--set runtimeClassName=nvidia # if using containerd with nvidia runtime class
# Verify: GPU nodes now advertise nvidia.com/gpu capacity
kubectl get nodes -o json | jq '.items[] | select(.status.capacity["nvidia.com/gpu"] != null) | {name:.metadata.name, gpus:.status.capacity["nvidia.com/gpu"]}'
Requesting GPUs
GPU resources must be specified in limits (not just requests). The scheduler only considers limits for extended resources. A pod that requests 1 GPU gets exclusive access — GPU resources are not shared between containers by default.
apiVersion: v1
kind: Pod
metadata:
name: gpu-training
namespace: ml
spec:
runtimeClassName: nvidia # use NVIDIA container runtime
containers:
- name: trainer
image: nvcr.io/nvidia/pytorch:24.01-py3
command: ["python", "train.py"]
resources:
limits:
nvidia.com/gpu: 1 # request 1 GPU — must be in limits
memory: "32Gi"
cpu: "8"
requests:
memory: "32Gi"
cpu: "8"
env:
- name: NVIDIA_VISIBLE_DEVICES
value: all # let the runtime select the allocated GPU
- name: NVIDIA_DRIVER_CAPABILITIES
value: compute,utility
GPU Node Pools
Isolate GPU nodes to prevent non-GPU workloads from consuming expensive GPU instances:
# Taint GPU nodes — non-GPU pods won't schedule here
kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoSchedule
# Label GPU nodes for affinity targeting
kubectl label nodes gpu-node-1 \
accelerator=nvidia-a100 \
gpu-memory=80Gi
# GPU pod must tolerate the taint AND use affinity to land on GPU nodes
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: accelerator
operator: In
values: ["nvidia-a100", "nvidia-h100"]
Time-Slicing
By default, one pod owns one GPU entirely. GPU time-slicing allows multiple pods to share a GPU by multiplexing access in time. This cuts cost for inference workloads that don't need 100% GPU utilisation.
apiVersion: v1
kind: ConfigMap
metadata:
name: nvidia-device-plugin-config
namespace: kube-system
data:
config.yaml: |
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4 # 4 pods can share each physical GPU
After enabling 4 replicas, a node with 2 GPUs advertises nvidia.com/gpu: 8. Each pod gets 1/4 of each GPU's compute time. No memory isolation — pods share VRAM and can OOM each other.
MIG Partitioning
NVIDIA Multi-Instance GPU (MIG) partitions an A100 or H100 into up to 7 isolated GPU instances with dedicated compute and memory. Unlike time-slicing, MIG provides true hardware isolation — one pod's CUDA errors don't affect another.
# A100 MIG profiles (partial list):
# nvidia.com/mig-1g.10gb — 1 GPU instance, 10 GB VRAM
# nvidia.com/mig-2g.20gb — 2 GPU instances, 20 GB VRAM
# nvidia.com/mig-4g.40gb — 4 GPU instances, 40 GB VRAM
# nvidia.com/mig-7g.80gb — full A100
resources:
limits:
nvidia.com/mig-2g.20gb: 1 # request a 20 GB MIG slice
Scheduling GPU Workloads
Best practices for scheduling:
| Workload type | Strategy |
|---|---|
| Training (large) | 1 pod per GPU, use multi-GPU via NCCL across pods with Operator (Kubeflow, Ray). Request full GPU. |
| Inference (latency) | Dedicated GPU node pool. Request full GPU per replica for predictable latency. HPA on custom metrics (queue depth). |
| Inference (batch) | Time-slicing or MIG to share GPU across replicas. Cheaper but higher latency variance. |
| Notebooks (interactive) | Time-slicing or MIG slices. Idle notebooks waste expensive GPU — enforce resource quotas and pod TTLs. |
Monitoring GPU Utilisation
# NVIDIA DCGM Exporter exposes GPU metrics to Prometheus
helm install dcgm-exporter nvidia/dcgm-exporter \
--namespace kube-system
# Key metrics:
# DCGM_FI_DEV_GPU_UTIL — GPU compute utilisation (%)
# DCGM_FI_DEV_MEM_COPY_UTIL — memory bandwidth utilisation (%)
# DCGM_FI_DEV_FB_USED — framebuffer (VRAM) used (MiB)
# DCGM_FI_DEV_POWER_USAGE — power draw (W)
# DCGM_FI_DEV_SM_CLOCK — SM clock frequency (MHz)
# Alert: GPU sitting idle while pod is running (wasted money)
alert: GPUUnderutilised
expr: DCGM_FI_DEV_GPU_UTIL < 10 and on(pod) kube_pod_status_phase{phase="Running"} == 1
for: 30m
labels:
severity: warning
annotations:
summary: "GPU allocated but utilisation < 10% for 30 minutes"
GPU Container Images
| Registry | Image | Use |
|---|---|---|
| NGC | nvcr.io/nvidia/pytorch:24.01-py3 | PyTorch training — includes CUDA, cuDNN, NCCL |
| NGC | nvcr.io/nvidia/tensorflow:24.01-tf2-py3 | TensorFlow training |
| NGC | nvcr.io/nvidia/tritonserver:24.01-py3 | Production inference server — multi-framework |
| Docker Hub | nvidia/cuda:12.3.2-runtime-ubuntu22.04 | Minimal CUDA runtime — build your own image on top |
The CUDA version in the container must be ≤ the CUDA version supported by the GPU driver on the node. Pin container image versions and node driver versions together — a driver upgrade on nodes can break images that assumed an older CUDA toolkit.