Skip to content

nvidia: Update addon helm charts to the latest versions #342

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 14, 2025

Conversation

claudiubelu
Copy link
Contributor

@claudiubelu claudiubelu commented Apr 10, 2025

Updates the default version of the GPU operator to "v25.3.0". Updates the default version of the network operator to "25.1.0".

Clean up installed CRDs by the Helm charts when disabling addon.

Also verify you have:

Updates the default version of the GPU operator to "v25.3.0".
Updates the default version of the network operator to "25.1.0".

Clean up installed CRDs by the Helm charts when disabling addon.
@claudiubelu
Copy link
Contributor Author

Enable the addon:

sudo snap install microk8s --classic --channel=1.32/stable
microk8s (1.32/stable) v1.32.3 from Canonical✓ installed

sudo microk8s status --wait-ready --timeout 600
microk8s is running
high-availability: no
  datastore master nodes: 127.0.0.1:19001
  datastore standby nodes: none
addons:
  enabled:
    dns                  # (core) CoreDNS
    ha-cluster           # (core) Configure high availability on the current node
    helm                 # (core) Helm - the package manager for Kubernetes
    helm3                # (core) Helm 3 - the package manager for Kubernetes
  disabled:
    cert-manager         # (core) Cloud native certificate management
    cis-hardening        # (core) Apply CIS K8s hardening
    community            # (core) The community addons repository
    dashboard            # (core) The Kubernetes dashboard
    gpu                  # (core) Alias to nvidia add-on
    host-access          # (core) Allow Pods connecting to Host services smoothly
    hostpath-storage     # (core) Storage class; allocates storage from host directory
    ingress              # (core) Ingress controller for external access
    kube-ovn             # (core) An advanced network fabric for Kubernetes
    mayastor             # (core) OpenEBS MayaStor
    metallb              # (core) Loadbalancer for your Kubernetes cluster
    metrics-server       # (core) K8s Metrics Server for API access to service metrics
    minio                # (core) MinIO object storage
    nvidia               # (core) NVIDIA hardware (GPU and network) support
    observability        # (core) A lightweight observability stack for logs, traces and metrics
    prometheus           # (core) Prometheus operator for monitoring and logging
    rbac                 # (core) Role-Based Access Control for authorisation
    registry             # (core) Private image registry exposed on localhost:32000
    rook-ceph            # (core) Distributed Ceph storage using Rook
    storage              # (core) Alias to hostpath-storage add-on, deprecated

sudo microk8s addons repo remove core
Removing /var/snap/microk8s/common/addons/core

sudo microk8s addons repo add core .
Cloning into '/var/snap/microk8s/common/addons/core'...
done.

sudo microk8s enable nvidia --network-operator
Infer repository core for addon nvidia
Addon core/dns is already enabled
Addon core/helm3 is already enabled
Checking if NVIDIA driver is already installed
"nvidia" already exists with the same configuration, skipping
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "nvidia" chart repository
Update Complete. ⎈Happy Helming!⎈
Deploy NVIDIA GPU operator
Using operator GPU driver
W0410 13:12:50.764125 1241577 warnings.go:70] spec.template.spec.affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].preference.matchExpressions[0].key: node-role.kubernetes.io/master is use "node-role.kubernetes.io/control-plane" instead
W0410 13:12:50.772073 1241577 warnings.go:70] spec.template.spec.affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].preference.matchExpressions[0].key: node-role.kubernetes.io/master is use "node-role.kubernetes.io/control-plane" instead
NAME: gpu-operator
LAST DEPLOYED: Thu Apr 10 13:12:49 2025
NAMESPACE: gpu-operator-resources
STATUS: deployed
REVISION: 1
TEST SUITE: None
Deployed NVIDIA GPU operator
Deploy NVIDIA Network operator
WARNING: Extra configuration might be needed for network-operator
Please refer to the docs for more details
W0410 13:12:54.188336 1241745 warnings.go:70] spec.template.spec.affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].preference.matchExpressions[0].key: node-role.kubernetes.io/master is use "node-role.kubernetes.io/control-plane" instead
W0410 13:13:27.299517 1241745 warnings.go:70] spec.template.spec.affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].preference.matchExpressions[0].key: node-role.kubernetes.io/master is use "node-role.kubernetes.io/control-plane" instead
W0410 13:13:31.867921 1241745 warnings.go:70] spec.template.spec.affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].preference.matchExpressions[0].key: node-role.kubernetes.io/master is use "node-role.kubernetes.io/control-plane" instead
W0410 13:13:31.907226 1241745 warnings.go:70] spec.template.spec.affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].preference.matchExpressions[0].key: node-role.kubernetes.io/master is use "node-role.kubernetes.io/control-plane" instead
NAME: network-operator
LAST DEPLOYED: Thu Apr 10 13:12:52 2025
NAMESPACE: nvidia-network-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Get Network Operator deployed resources by running the following commands:

$ kubectl -n nvidia-network-operator get pods
Deployed NVIDIA Network operator

Pods:

microk8s.kubectl -n gpu-operator-resources get all -o wide
NAME                                                             READY   STATUS    RESTARTS   AGE     IP             NODE     NOMINATED NODE   READINESS GATES
pod/gpu-operator-666bbffcd-rx4zh                                 1/1     Running   0          3m25s   10.1.243.198   ubuntu   <none>           <none>
pod/gpu-operator-node-feature-discovery-gc-7c7f68d5f4-hb56l      1/1     Running   0          3m25s   10.1.243.196   ubuntu   <none>           <none>
pod/gpu-operator-node-feature-discovery-master-57dc4d868-5ksdm   1/1     Running   0          3m25s   10.1.243.197   ubuntu   <none>           <none>
pod/gpu-operator-node-feature-discovery-worker-4cpl7             1/1     Running   0          3m25s   10.1.243.195   ubuntu   <none>           <none>

NAME                           TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE     SELECTOR
service/gpu-operator           ClusterIP   10.152.183.232   <none>        8080/TCP   2m36s   app=gpu-operator
service/nvidia-dcgm-exporter   ClusterIP   10.152.183.165   <none>        9400/TCP   2m36s   app=nvidia-dcgm-exporter

NAME                                                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE     CONTAINERS   IMAGES                                               SELECTOR
daemonset.apps/gpu-operator-node-feature-discovery-worker   1         1         1       1            1           <none>          3m25s   worker       registry.k8s.io/nfd/node-feature-discovery:v0.17.2   app.kubernetes.io/instance=gpu-operator,app.kubernetes.io/name=node-feature-discovery,role=worker

NAME                                                         READY   UP-TO-DATE   AVAILABLE   AGE     CONTAINERS     IMAGES                                               SELECTOR
deployment.apps/gpu-operator                                 1/1     1            1           3m25s   gpu-operator   nvcr.io/nvidia/gpu-operator:v25.3.0                  app=gpu-operator,app.kubernetes.io/component=gpu-operator
deployment.apps/gpu-operator-node-feature-discovery-gc       1/1     1            1           3m25s   gc             registry.k8s.io/nfd/node-feature-discovery:v0.17.2   app.kubernetes.io/instance=gpu-operator,app.kubernetes.io/name=node-feature-discovery,role=gc
deployment.apps/gpu-operator-node-feature-discovery-master   1/1     1            1           3m25s   master         registry.k8s.io/nfd/node-feature-discovery:v0.17.2   app.kubernetes.io/instance=gpu-operator,app.kubernetes.io/name=node-feature-discovery,role=master

NAME                                                                   DESIRED   CURRENT   READY   AGE     CONTAINERS     IMAGES                                               SELECTOR
replicaset.apps/gpu-operator-666bbffcd                                 1         1         1       3m25s   gpu-operator   nvcr.io/nvidia/gpu-operator:v25.3.0                  app=gpu-operator,app.kubernetes.io/component=gpu-operator,pod-template-hash=666bbffcd
replicaset.apps/gpu-operator-node-feature-discovery-gc-7c7f68d5f4      1         1         1       3m25s   gc             registry.k8s.io/nfd/node-feature-discovery:v0.17.2   app.kubernetes.io/instance=gpu-operator,app.kubernetes.io/name=node-feature-discovery,pod-template-hash=7c7f68d5f4,role=gc
replicaset.apps/gpu-operator-node-feature-discovery-master-57dc4d868   1         1         1       3m25s   master         registry.k8s.io/nfd/node-feature-discovery:v0.17.2   app.kubernetes.io/instance=gpu-operator,app.kubernetes.io/name=node-feature-discovery,pod-template-hash=57dc4d868,role=master


microk8s.kubectl -n nvidia-network-operator get all -o wide
NAME                                                                  READY   STATUS    RESTARTS   AGE     IP             NODE     NOMINATED NODE   READINESS GATES
pod/network-operator-6d5b756846-shwqn                                 1/1     Running   0          3m13s   10.1.243.203   ubuntu   <none>           <none>
pod/network-operator-node-feature-discovery-gc-5549bd5db-d4t9s        1/1     Running   0          3m13s   10.1.243.202   ubuntu   <none>           <none>
pod/network-operator-node-feature-discovery-master-865bfff66d-zxp7s   1/1     Running   0          3m13s   10.1.243.204   ubuntu   <none>           <none>
pod/network-operator-node-feature-discovery-worker-sqg99              1/1     Running   0          3m13s   10.1.243.201   ubuntu   <none>           <none>

NAME                                                            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE     CONTAINERS   IMAGES                                               SELECTOR
daemonset.apps/network-operator-node-feature-discovery-worker   1         1         1       1            1           <none>          3m13s   worker       registry.k8s.io/nfd/node-feature-discovery:v0.17.0   app.kubernetes.io/instance=network-operator,app.kubernetes.io/name=node-feature-discovery,role=worker

NAME                                                             READY   UP-TO-DATE   AVAILABLE   AGE     CONTAINERS         IMAGES                                                 SELECTOR
deployment.apps/network-operator                                 1/1     1            1           3m13s   network-operator   nvcr.io/nvidia/cloud-native/network-operator:v25.1.0   app.kubernetes.io/instance=network-operator,app.kubernetes.io/name=network-operator
deployment.apps/network-operator-node-feature-discovery-gc       1/1     1            1           3m13s   gc                 registry.k8s.io/nfd/node-feature-discovery:v0.17.0     app.kubernetes.io/instance=network-operator,app.kubernetes.io/name=node-feature-discovery,role=gc
deployment.apps/network-operator-node-feature-discovery-master   1/1     1            1           3m13s   master             registry.k8s.io/nfd/node-feature-discovery:v0.17.0     app.kubernetes.io/instance=network-operator,app.kubernetes.io/name=node-feature-discovery,role=master

NAME                                                                        DESIRED   CURRENT   READY   AGE     CONTAINERS         IMAGES                                                 SELECTOR
replicaset.apps/network-operator-6d5b756846                                 1         1         1       3m13s   network-operator   nvcr.io/nvidia/cloud-native/network-operator:v25.1.0   app.kubernetes.io/instance=network-operator,app.kubernetes.io/name=network-operator,pod-template-hash=6d5b756846
replicaset.apps/network-operator-node-feature-discovery-gc-5549bd5db        1         1         1       3m13s   gc                 registry.k8s.io/nfd/node-feature-discovery:v0.17.0     app.kubernetes.io/instance=network-operator,app.kubernetes.io/name=node-feature-discovery,pod-template-hash=5549bd5db,role=gc
replicaset.apps/network-operator-node-feature-discovery-master-865bfff66d   1         1         1       3m13s   master             registry.k8s.io/nfd/node-feature-discovery:v0.17.0     app.kubernetes.io/instance=network-operator,app.kubernetes.io/name=node-feature-discovery,pod-template-hash=865bfff66d,role=master

Disabling addon:

sudo microk8s disable nvidia
Infer repository core for addon nvidia
Disabling NVIDIA support
W0410 13:17:11.982359 1251675 warnings.go:70] spec.template.spec.affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].preference.matchExpressions[0].key: node-role.kubernetes.io/master is use "node-role.kubernetes.io/control-plane" instead
release "gpu-operator" uninstalled
W0410 13:17:17.299840 1252653 warnings.go:70] spec.template.spec.affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].preference.matchExpressions[0].key: node-role.kubernetes.io/master is use "node-role.kubernetes.io/control-plane" instead
release "network-operator" uninstalled
customresourcedefinition.apiextensions.k8s.io "clusterpolicies.nvidia.com" deleted
customresourcedefinition.apiextensions.k8s.io "nvidiadrivers.nvidia.com" deleted
customresourcedefinition.apiextensions.k8s.io "hostdevicenetworks.mellanox.com" deleted
customresourcedefinition.apiextensions.k8s.io "ipoibnetworks.mellanox.com" deleted
customresourcedefinition.apiextensions.k8s.io "macvlannetworks.mellanox.com" deleted
customresourcedefinition.apiextensions.k8s.io "nicclusterpolicies.mellanox.com" deleted
NVIDIA support disabled

@claudiubelu claudiubelu marked this pull request as ready for review April 10, 2025 14:34
Copy link
Member

@berkayoz berkayoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

List all:

ubuntu@ip-172-31-52-18:~$ sudo microk8s kubectl get all -n gpu-operator-resources -o wide
NAME                                                             READY   STATUS      RESTARTS   AGE   IP            NODE              NOMINATED NODE   READINESS GATES
pod/gpu-feature-discovery-pl2t9                                  1/1     Running     0          20m   10.1.165.12   ip-172-31-52-18   <none>           <none>
pod/gpu-operator-666bbffcd-wfnxn                                 1/1     Running     0          21m   10.1.165.4    ip-172-31-52-18   <none>           <none>
pod/gpu-operator-node-feature-discovery-gc-7c7f68d5f4-kj9cm      1/1     Running     0          21m   10.1.165.5    ip-172-31-52-18   <none>           <none>
pod/gpu-operator-node-feature-discovery-master-57dc4d868-2vlfg   1/1     Running     0          21m   10.1.165.6    ip-172-31-52-18   <none>           <none>
pod/gpu-operator-node-feature-discovery-worker-8h8bl             1/1     Running     0          21m   10.1.165.3    ip-172-31-52-18   <none>           <none>
pod/nvidia-container-toolkit-daemonset-ghcx2                     1/1     Running     0          20m   10.1.165.9    ip-172-31-52-18   <none>           <none>
pod/nvidia-cuda-validator-hsh9g                                  0/1     Completed   0          16m   10.1.165.14   ip-172-31-52-18   <none>           <none>
pod/nvidia-dcgm-exporter-5lsgk                                   1/1     Running     0          20m   10.1.165.10   ip-172-31-52-18   <none>           <none>
pod/nvidia-device-plugin-daemonset-ppbxt                         1/1     Running     0          20m   10.1.165.13   ip-172-31-52-18   <none>           <none>
pod/nvidia-driver-daemonset-8b2mr                                1/1     Running     0          21m   10.1.165.7    ip-172-31-52-18   <none>           <none>
pod/nvidia-operator-validator-5xnht                              1/1     Running     0          20m   10.1.165.11   ip-172-31-52-18   <none>           <none>

NAME                           TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE   SELECTOR
service/gpu-operator           ClusterIP   10.152.183.183   <none>        8080/TCP   21m   app=gpu-operator
service/nvidia-dcgm-exporter   ClusterIP   10.152.183.154   <none>        9400/TCP   21m   app=nvidia-dcgm-exporter

NAME                                                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                          AGE   CONTAINERS                     IMAGES                                                            SELECTOR
daemonset.apps/gpu-feature-discovery                        1         1         1       1            1           nvidia.com/gpu.deploy.gpu-feature-discovery=true                       21m   gpu-feature-discovery          nvcr.io/nvidia/k8s-device-plugin:v0.17.1                          app=gpu-feature-discovery,app.kubernetes.io/part-of=nvidia-gpu
daemonset.apps/gpu-operator-node-feature-discovery-worker   1         1         1       1            1           <none>                                                                 21m   worker                         registry.k8s.io/nfd/node-feature-discovery:v0.17.2                app.kubernetes.io/instance=gpu-operator,app.kubernetes.io/name=node-feature-discovery,role=worker
daemonset.apps/nvidia-container-toolkit-daemonset           1         1         1       1            1           nvidia.com/gpu.deploy.container-toolkit=true                           21m   nvidia-container-toolkit-ctr   nvcr.io/nvidia/k8s/container-toolkit:v1.17.5-ubuntu20.04          app=nvidia-container-toolkit-daemonset
daemonset.apps/nvidia-dcgm-exporter                         1         1         1       1            1           nvidia.com/gpu.deploy.dcgm-exporter=true                               21m   nvidia-dcgm-exporter           nvcr.io/nvidia/k8s/dcgm-exporter:4.1.1-4.0.4-ubuntu22.04          app=nvidia-dcgm-exporter
daemonset.apps/nvidia-device-plugin-daemonset               1         1         1       1            1           nvidia.com/gpu.deploy.device-plugin=true                               21m   nvidia-device-plugin           nvcr.io/nvidia/k8s-device-plugin:v0.17.1                          app=nvidia-device-plugin-daemonset
daemonset.apps/nvidia-device-plugin-mps-control-daemon      0         0         0       0            0           nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/mps.capable=true   21m   mps-control-daemon-ctr         nvcr.io/nvidia/k8s-device-plugin:v0.17.1                          app=nvidia-device-plugin-mps-control-daemon
daemonset.apps/nvidia-driver-daemonset                      1         1         1       1            1           nvidia.com/gpu.deploy.driver=true                                      21m   nvidia-driver-ctr              nvcr.io/nvidia/driver:570.124.06-ubuntu24.04                      app=nvidia-driver-daemonset
daemonset.apps/nvidia-mig-manager                           0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true                                 21m   nvidia-mig-manager             nvcr.io/nvidia/cloud-native/k8s-mig-manager:v0.12.1-ubuntu20.04   app=nvidia-mig-manager
daemonset.apps/nvidia-operator-validator                    1         1         1       1            1           nvidia.com/gpu.deploy.operator-validator=true                          21m   nvidia-operator-validator      nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.0        app=nvidia-operator-validator,app.kubernetes.io/part-of=gpu-operator

NAME                                                         READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS     IMAGES                                               SELECTOR
deployment.apps/gpu-operator                                 1/1     1            1           21m   gpu-operator   nvcr.io/nvidia/gpu-operator:v25.3.0                  app=gpu-operator,app.kubernetes.io/component=gpu-operator
deployment.apps/gpu-operator-node-feature-discovery-gc       1/1     1            1           21m   gc             registry.k8s.io/nfd/node-feature-discovery:v0.17.2   app.kubernetes.io/instance=gpu-operator,app.kubernetes.io/name=node-feature-discovery,role=gc
deployment.apps/gpu-operator-node-feature-discovery-master   1/1     1            1           21m   master         registry.k8s.io/nfd/node-feature-discovery:v0.17.2   app.kubernetes.io/instance=gpu-operator,app.kubernetes.io/name=node-feature-discovery,role=master

NAME                                                                   DESIRED   CURRENT   READY   AGE   CONTAINERS     IMAGES                                               SELECTOR
replicaset.apps/gpu-operator-666bbffcd                                 1         1         1       21m   gpu-operator   nvcr.io/nvidia/gpu-operator:v25.3.0                  app=gpu-operator,app.kubernetes.io/component=gpu-operator,pod-template-hash=666bbffcd
replicaset.apps/gpu-operator-node-feature-discovery-gc-7c7f68d5f4      1         1         1       21m   gc             registry.k8s.io/nfd/node-feature-discovery:v0.17.2   app.kubernetes.io/instance=gpu-operator,app.kubernetes.io/name=node-feature-discovery,pod-template-hash=7c7f68d5f4,role=gc
replicaset.apps/gpu-operator-node-feature-discovery-master-57dc4d868   1         1         1       21m   master         registry.k8s.io/nfd/node-feature-discovery:v0.17.2   app.kubernetes.io/instance=gpu-operator,app.kubernetes.io/name=node-feature-discovery,pod-template-hash=57dc4d868,role=master

Cuda Add:

ubuntu@ip-172-31-52-18:~$ sudo microk8s kubectl logs pod/cuda-vector-add
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

@berkayoz berkayoz merged commit 91faf85 into canonical:main Apr 14, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants