4. NVIDIA-GPU¶

4.1. Cluster Configuration¶

Note

Clearly since this is cluster configuration this only applies to administrators. Normal users will have no need to follow this cluster configuration step.

To enable nvidia-container-toolking previously nvidia-docker, by editing /etc/docker/daemon.json since kubernetes does not support the docker –gpu option:

https://docs.nvidia.com/datacenter/cloud-native/kubernetes/dcgme2e.html#install-nvidia-container-toolkit-previously-nvidia-docker2

{
   "default-runtime": "nvidia",
   "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
      }
   }
}

Then we need to install the nvidia operator:

helm repo add nvdp https://nvidia.github.io/k8s-device-plugin \
 && helm repo update \
 && helm install --generate-name nvdp/nvidia-device-plugin

4.2. Pod Specification¶

4.3. Test Pods¶

Test pod for cuda functionality

Manifest example

apiVersion: v1
kind: Pod
metadata:
  name: gpu-operator-test
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vector-add
    image: "nvidia/samples:vectoradd-cuda10.2"
    resources:
      limits:
         nvidia.com/gpu: 1

Test job for nvidia-smi

Manifest example

apiVersion: batch/v1
kind: Job
metadata:
  name: smi
spec:
  template:
    spec:
      containers:
      - name: smi
        image: docker.io/nvidia/cuda:11.0-base
        command: ['nvidia-smi']
      restartPolicy: OnFailure