Advanced Topics

GPU Workloads on Kubernetes

● Advanced ⏱ 20 min read

GPUs are expensive, and idle GPUs are wasted money. Kubernetes manages GPU allocation through the device plugin framework — GPUs are exposed as schedulable resources just like CPU and memory. This guide covers the full stack: device plugin installation, resource requests, multi-instance GPU partitioning, and monitoring GPU utilisation to catch idle allocations.

NVIDIA Device Plugin

The NVIDIA device plugin runs as a DaemonSet on GPU nodes. It discovers GPUs on each node via the NVIDIA Container Toolkit, registers them with the kubelet as nvidia.com/gpu resources, and configures containers to access the GPU device files.

install NVIDIA device plugin

# Prerequisites on GPU nodes: NVIDIA drivers + nvidia-container-toolkit installed
# Verify on a GPU node:
nvidia-smi

# Install device plugin via Helm
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

helm install nvidia-device-plugin nvidia/nvidia-device-plugin \
  --namespace kube-system \
  --set runtimeClassName=nvidia    # if using containerd with nvidia runtime class

# Verify: GPU nodes now advertise nvidia.com/gpu capacity
kubectl get nodes -o json | jq '.items[] | select(.status.capacity["nvidia.com/gpu"] != null) | {name:.metadata.name, gpus:.status.capacity["nvidia.com/gpu"]}'

Requesting GPUs

GPU resources must be specified in limits (not just requests). The scheduler only considers limits for extended resources. A pod that requests 1 GPU gets exclusive access — GPU resources are not shared between containers by default.

pod requesting a GPU

apiVersion: v1
kind: Pod
metadata:
  name: gpu-training
  namespace: ml
spec:
  runtimeClassName: nvidia           # use NVIDIA container runtime
  containers:
  - name: trainer
    image: nvcr.io/nvidia/pytorch:24.01-py3
    command: ["python", "train.py"]
    resources:
      limits:
        nvidia.com/gpu: 1            # request 1 GPU — must be in limits
        memory: "32Gi"
        cpu: "8"
      requests:
        memory: "32Gi"
        cpu: "8"
    env:
    - name: NVIDIA_VISIBLE_DEVICES
      value: all                     # let the runtime select the allocated GPU
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: compute,utility

GPU allocation — device plugin flow

Scheduler

finds node with nvidia.com/gpu ≥ 1 available

↓

Kubelet → Device Plugin

allocates GPU device, injects /dev/nvidia0 + env vars

↓

Container

CUDA code runs; GPU exclusively owned until pod exits

The device plugin allocates a specific GPU to the container and injects the device file and CUDA environment variables. The GPU is exclusively held until the pod terminates.

GPU Node Pools

Isolate GPU nodes to prevent non-GPU workloads from consuming expensive GPU instances:

# Taint GPU nodes — non-GPU pods won't schedule here
kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoSchedule

# Label GPU nodes for affinity targeting
kubectl label nodes gpu-node-1 \
  accelerator=nvidia-a100 \
  gpu-memory=80Gi

# GPU pod must tolerate the taint AND use affinity to land on GPU nodes
spec:
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: accelerator
            operator: In
            values: ["nvidia-a100", "nvidia-h100"]

Time-Slicing

By default, one pod owns one GPU entirely. GPU time-slicing allows multiple pods to share a GPU by multiplexing access in time. This cuts cost for inference workloads that don't need 100% GPU utilisation.

device plugin ConfigMap — enable time-slicing

apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-device-plugin-config
  namespace: kube-system
data:
  config.yaml: |
    version: v1
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4              # 4 pods can share each physical GPU

After enabling 4 replicas, a node with 2 GPUs advertises nvidia.com/gpu: 8. Each pod gets 1/4 of each GPU's compute time. No memory isolation — pods share VRAM and can OOM each other.

MIG Partitioning

NVIDIA Multi-Instance GPU (MIG) partitions an A100 or H100 into up to 7 isolated GPU instances with dedicated compute and memory. Unlike time-slicing, MIG provides true hardware isolation — one pod's CUDA errors don't affect another.

request a MIG instance

# A100 MIG profiles (partial list):
# nvidia.com/mig-1g.10gb   — 1 GPU instance, 10 GB VRAM
# nvidia.com/mig-2g.20gb   — 2 GPU instances, 20 GB VRAM
# nvidia.com/mig-4g.40gb   — 4 GPU instances, 40 GB VRAM
# nvidia.com/mig-7g.80gb   — full A100

resources:
  limits:
    nvidia.com/mig-2g.20gb: 1      # request a 20 GB MIG slice

Scheduling GPU Workloads

Best practices for scheduling:

Workload type	Strategy
Training (large)	1 pod per GPU, use multi-GPU via NCCL across pods with Operator (Kubeflow, Ray). Request full GPU.
Inference (latency)	Dedicated GPU node pool. Request full GPU per replica for predictable latency. HPA on custom metrics (queue depth).
Inference (batch)	Time-slicing or MIG to share GPU across replicas. Cheaper but higher latency variance.
Notebooks (interactive)	Time-slicing or MIG slices. Idle notebooks waste expensive GPU — enforce resource quotas and pod TTLs.

Monitoring GPU Utilisation

DCGM Exporter — GPU metrics for Prometheus

# NVIDIA DCGM Exporter exposes GPU metrics to Prometheus
helm install dcgm-exporter nvidia/dcgm-exporter \
  --namespace kube-system

# Key metrics:
# DCGM_FI_DEV_GPU_UTIL          — GPU compute utilisation (%)
# DCGM_FI_DEV_MEM_COPY_UTIL     — memory bandwidth utilisation (%)
# DCGM_FI_DEV_FB_USED           — framebuffer (VRAM) used (MiB)
# DCGM_FI_DEV_POWER_USAGE       — power draw (W)
# DCGM_FI_DEV_SM_CLOCK          — SM clock frequency (MHz)

# Alert: GPU sitting idle while pod is running (wasted money)
alert: GPUUnderutilised
expr: DCGM_FI_DEV_GPU_UTIL < 10 and on(pod) kube_pod_status_phase{phase="Running"} == 1
for: 30m
labels:
  severity: warning
annotations:
  summary: "GPU allocated but utilisation < 10% for 30 minutes"

GPU Container Images

Registry	Image	Use
NGC	`nvcr.io/nvidia/pytorch:24.01-py3`	PyTorch training — includes CUDA, cuDNN, NCCL
NGC	`nvcr.io/nvidia/tensorflow:24.01-tf2-py3`	TensorFlow training
NGC	`nvcr.io/nvidia/tritonserver:24.01-py3`	Production inference server — multi-framework
Docker Hub	`nvidia/cuda:12.3.2-runtime-ubuntu22.04`	Minimal CUDA runtime — build your own image on top

💡

Pin CUDA driver version

The CUDA version in the container must be ≤ the CUDA version supported by the GPU driver on the node. Pin container image versions and node driver versions together — a driver upgrade on nodes can break images that assumed an older CUDA toolkit.