4. NVIDIA-GPU

4.1. Cluster Configuration

Note

Clearly since this is cluster configuration this only applies to administrators. Normal users will have no need to follow this cluster configuration step.

To enable nvidia-container-toolking previously nvidia-docker, by editing /etc/docker/daemon.json since kubernetes does not support the docker –gpu option:

https://docs.nvidia.com/datacenter/cloud-native/kubernetes/dcgme2e.html#install-nvidia-container-toolkit-previously-nvidia-docker2

{
   "default-runtime": "nvidia",
   "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
      }
   }
}

Then we need to install the nvidia operator:

helm repo add nvdp https://nvidia.github.io/k8s-device-plugin \
 && helm repo update \
 && helm install --generate-name nvdp/nvidia-device-plugin

4.2. Pod Specification

4.3. Test Pods

Test pod for cuda functionality

Manifest example

 1apiVersion: v1
 2kind: Pod
 3metadata:
 4  name: gpu-operator-test
 5spec:
 6  restartPolicy: OnFailure
 7  containers:
 8  - name: cuda-vector-add
 9    image: "nvidia/samples:vectoradd-cuda10.2"
10    resources:
11      limits:
12         nvidia.com/gpu: 1

Test job for nvidia-smi

Manifest example

 1apiVersion: batch/v1
 2kind: Job
 3metadata:
 4  name: smi
 5spec:
 6  template:
 7    spec:
 8      containers:
 9      - name: smi
10        image: docker.io/nvidia/cuda:11.0-base
11        command: ['nvidia-smi']
12      restartPolicy: OnFailure