Skip to content

Commit 4f10463

Browse files
authored
Merge pull request NVIDIA#416 from jgehrcke/jp/readme
Update README to reflect current state of project
2 parents 69a2e71 + 9ed3f1d commit 4f10463

File tree

2 files changed

+78
-149
lines changed

2 files changed

+78
-149
lines changed

CONTRIBUTING.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
# Contribute to the NVIDIA Kubernetes DRA Driver
1+
# Contribute to the NVIDIA DRA Driver for GPUs
22

3-
Want to hack on the NVIDIA Kubernetes DRA Driver Project? Awesome!
3+
Want to hack on the NVIDIA DRA Driver for GPUs? Awesome!
44
We only require you to sign your work, the below section describes this!
55

66
## Sign your work

README.md

Lines changed: 76 additions & 147 deletions
Original file line numberDiff line numberDiff line change
@@ -1,178 +1,107 @@
1-
# Dynamic Resource Allocation (DRA) for NVIDIA GPUs in Kubernetes
2-
3-
This DRA resource driver is currently under active development and not yet
4-
designed for production use.
5-
We may (at times) decide to push commits over `main` until we have something more stable.
6-
Use at your own risk.
7-
8-
A document and demo of the DRA support for GPUs provided by this repo can be found below:
9-
| Document | Demo |
10-
|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
11-
| [<img width="300" alt="Dynamic Resource Allocation (DRA) for GPUs in Kubernetes" src="https://drive.google.com/uc?export=download&id=12EwdvHHI92FucRO2tuIqLR33OC8MwCQK">](https://docs.google.com/document/d/1BNWqgx_SmZDi-va_V31v3DnuVwYnF2EmN7D-O_fB6Oo) | [<img width="300" alt="Demo of Dynamic Resource Allocation (DRA) for GPUs in Kubernetes" src="https://drive.google.com/uc?export=download&id=1UzB-EBEVwUTRF7R0YXbGe9hvTjuKaBlm">](https://drive.google.com/file/d/1iLg2FEAEilb1dcI27TnB19VYtbcvgKhS/view?usp=sharing "Demo of Dynamic Resource Allocation (DRA) for GPUs in Kubernetes") |
12-
13-
## Demo
14-
15-
This section describes using `kind` to demo the functionality of the NVIDIA GPU DRA Driver.
16-
17-
First since we'll launch kind with GPU support, ensure that the following prerequisites are met:
18-
1. `kind` is installed. See the official documentation [here](https://kind.sigs.k8s.io/docs/user/quick-start/#installation).
19-
1. Ensure that the NVIDIA Container Toolkit is installed on your system. This
20-
can be done by following the instructions
21-
[here](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html).
22-
1. Configure the NVIDIA Container Runtime as the **default** Docker runtime:
23-
```console
24-
sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
25-
```
26-
1. Restart Docker to apply the changes:
27-
```console
28-
sudo systemctl restart docker
29-
```
30-
1. Set the `accept-nvidia-visible-devices-as-volume-mounts` option to `true` in
31-
the `/etc/nvidia-container-runtime/config.toml` file to configure the NVIDIA
32-
Container Runtime to use volume mounts to select devices to inject into a
33-
container.
34-
``` console
35-
sudo nvidia-ctk config --in-place --set accept-nvidia-visible-devices-as-volume-mounts=true
36-
```
37-
38-
1. Show the current set of GPUs on the machine:
39-
```console
40-
nvidia-smi -L
41-
```
42-
43-
We start by first cloning this repository and `cd`ing into it.
44-
All of the scripts and example Pod specs used in this demo are in the `demo`
45-
subdirectory, so take a moment to browse through the various files and see
46-
what's available:
1+
# NVIDIA DRA Driver for GPUs
472

48-
```console
49-
git clone https://github.com/NVIDIA/k8s-dra-driver-gpu.git
50-
```
51-
```console
52-
cd k8s-dra-driver-gpu
53-
```
3+
Enables
544

55-
### Setting up the infrastructure
5+
* flexible and powerful allocation and dynamic reconfiguration of GPUs as well as
6+
* allocation of ComputeDomains for robust and secure Multi-Node NVLink.
567

57-
First, create a `kind` cluster to run the demo:
58-
```bash
59-
./demo/clusters/kind/create-cluster.sh
60-
```
8+
For Kubernetes 1.32 or newer, with Dynamic Resource Allocation (DRA) [enabled](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#enabling-dynamic-resource-allocation).
619

62-
From here we will build the image for the example resource driver:
63-
```console
64-
./demo/clusters/kind/build-dra-driver-gpu.sh
65-
```
10+
## Overview
6611

67-
This also makes the built images available to the `kind` cluster.
12+
DRA is a novel concept in Kubernetes for flexibly requesting, configuring, and sharing specialized devices like GPUs.
13+
To learn more about DRA in general, good starting points are: [Kubernetes docs](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/), [GKE docs](https://cloud.google.com/kubernetes-engine/docs/concepts/about-dynamic-resource-allocation), [Kubernetes blog](https://kubernetes.io/blog/2025/05/01/kubernetes-v1-33-dra-updates/).
6814

69-
We now install the NVIDIA GPU DRA driver:
70-
```console
71-
./demo/clusters/kind/install-dra-driver-gpu.sh
72-
```
15+
Most importantly, DRA puts resource configuration and scheduling in the hands of 3rd-party vendors.
7316

74-
This should show two pods running in the `nvidia-dra-driver-gpu` namespace:
75-
```console
76-
kubectl get pods -n nvidia-dra-driver-gpu
77-
```
78-
```
79-
$ kubectl get pods -n nvidia-dra-driver-gpu
80-
NAME READY STATUS RESTARTS AGE
81-
nvidia-dra-driver-gpu-controller-697898fc6b-g85zx 1/1 Running 0 40s
82-
nvidia-dra-driver-gpu-kubelet-plugin-kkwf7 2/2 Running 0 40s
83-
```
17+
The NVIDIA DRA Driver for GPUs manages two types of resources: **GPUs** and **ComputeDomains**. Correspondingly, it contains two DRA kubelet plugins: [gpu-kubelet-plugin](https://github.com/NVIDIA/k8s-dra-driver-gpu/tree/main/cmd/gpu-kubelet-plugin), [compute-domain-kubelet-plugin](https://github.com/NVIDIA/k8s-dra-driver-gpu/tree/main/cmd/compute-domain-kubelet-plugin). Upon driver installation, each of these two parts can be enabled or disabled separately.
8418

85-
### Run the examples by following the steps in the demo script
86-
Finally, you can run the various examples contained in the `demo/specs/quickstart` folder.
87-
With the most recent updates for Kubernetes v1.31, only the first 3 examples in
88-
this folder are currently functional.
19+
The two sections below provide a brief overview for each of these two parts of this DRA driver.
8920

90-
You can run them as follows:
91-
```console
92-
kubectl apply --filename=demo/specs/quickstart/gpu-test{1,2,3}.yaml
93-
```
21+
### `ComputeDomain`s
9422

95-
Get the pods' statuses. Depending on which GPUs are available, running the first three examples will produce output similar to the following...
23+
An abstraction for robust and secure Multi-Node NVLink (MNNVL). Officially supported.
24+
25+
An individual `ComputeDomain` (CD) guarantees MNNVL-reachability between pods that are _in_ the CD, and secure isolation from other pods that are _not in_ the CD.
26+
27+
In terms of placement, a CD follows the workload. In terms of lifetime, a CD is ephemeral: its lifetime is bound to the lifetime of the consuming workload.
28+
For more background on how `ComputeDomain`s facilitate orchestrating MNNVL workloads on Kubernetes (and on NVIDIA GB200 systems in particular), see [this](https://docs.google.com/document/d/1PrdDofsPFVJuZvcv-vtlI9n2eAh-YVf_fRQLIVmDwVY/edit?tab=t.0#heading=h.qkogm924v5so) doc and [this](https://docs.google.com/presentation/d/1Xupr8IZVAjs5bNFKJnYaK0LE7QWETnJjkz6KOfLu87E/edit?pli=1&slide=id.g28ac369118f_0_1647#slide=id.g28ac369118f_0_1647) slide deck.
29+
For an outlook and specific plans for improvements, please refer to [these](https://github.com/NVIDIA/k8s-dra-driver-gpu/releases/tag/v25.3.0-rc.3) release notes.
30+
31+
If you've heard about IMEX: this DRA driver orchestrates IMEX primitives (daemons, domains, channels) under the hood.
32+
33+
### `GPU`s
34+
35+
The GPU allocation side of this DRA driver [will enable powerful features](https://docs.google.com/document/d/1BNWqgx_SmZDi-va_V31v3DnuVwYnF2EmN7D-O_fB6Oo) (such as dynamic allocation of MIG devices).
36+
To learn about what we're planning to build, please have a look at [these](https://github.com/NVIDIA/k8s-dra-driver-gpu/releases/tag/v25.3.0-rc.3) release notes.
37+
38+
While some GPU allocation features can be tried out, they are not yet officially supported.
39+
Hence, the GPU kubelet plugin is currently disabled by default in the Helm chart installation.
40+
41+
For exploration and demonstration purposes, see the "demo" section below, and also browse the `demo/specs/quickstart` directory in this repository.
42+
43+
## Installation
44+
45+
As of today, the recommended installation method is via Helm.
46+
Detailed instructions can (for now) be found [here](https://github.com/NVIDIA/k8s-dra-driver-gpu/discussions/249).
47+
In the future, this driver will be included in the [NVIDIA GPU Operator](https://github.com/NVIDIA/gpu-operator) and does not need to be installed separately anymore.
48+
49+
## A (kind) demo
50+
51+
Below, we demonstrate a basic use case: sharing a single GPU across two containers running in the same Kubernetes pod.
52+
53+
**Step 1: install dependencies**
54+
55+
Running this demo requires
56+
* kind (follow the official [installation docs](https://kind.sigs.k8s.io/docs/user/quick-start/#installation))
57+
* NVIDIA Container Toolkit & Runtime (follow a [previous version](https://github.com/NVIDIA/k8s-dra-driver-gpu/blob/5a4717f1ea613ad47bafccb467582bf2425f20f1/README.md#demo) of this readme for setup instructions)
58+
59+
**Step 2: create kind cluster with the DRA driver installed**
60+
61+
Start by cloning this repository, and `cd`in into it:
9662

97-
**Note:** there is a [known issue with kind](https://kind.sigs.k8s.io/docs/user/known-issues/#pod-errors-due-to-too-many-open-files). You may see an error while trying to tail the log of a running pod in the kind cluster: `failed to create fsnotify watcher: too many open files.` The issue may be resolved by increasing the value for `fs.inotify.max_user_watches`.
98-
```console
99-
kubectl get pod -A -l app=pod
100-
```
101-
```
102-
NAMESPACE NAME READY STATUS RESTARTS AGE
103-
gpu-test1 pod1 1/1 Running 0 34s
104-
gpu-test1 pod2 1/1 Running 0 34s
105-
gpu-test2 pod 2/2 Running 0 34s
106-
gpu-test3 pod1 1/1 Running 0 34s
107-
gpu-test3 pod2 1/1 Running 0 34s
108-
```
109-
```console
110-
kubectl logs -n gpu-test1 -l app=pod
111-
```
112-
```
113-
GPU 0: A100-SXM4-40GB (UUID: GPU-662077db-fa3f-0d8f-9502-21ab0ef058a2)
114-
GPU 0: A100-SXM4-40GB (UUID: GPU-4cf8db2d-06c0-7d70-1a51-e59b25b2c16c)
115-
```
11663
```console
117-
kubectl logs -n gpu-test2 pod --all-containers
118-
```
119-
```
120-
GPU 0: A100-SXM4-40GB (UUID: GPU-79a2ba02-a537-ccbf-2965-8e9d90c0bd54)
121-
GPU 0: A100-SXM4-40GB (UUID: GPU-79a2ba02-a537-ccbf-2965-8e9d90c0bd54)
64+
git clone https://github.com/NVIDIA/k8s-dra-driver-gpu.git
65+
cd k8s-dra-driver-gpu
12266
```
12367

68+
Next up, build this driver's container image and create a kind-based Kubernetes cluster:
69+
12470
```console
125-
kubectl logs -n gpu-test3 -l app=pod
126-
```
127-
```
128-
GPU 0: A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c)
129-
GPU 0: A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c)
71+
export KIND_CLUSTER_NAME="kind-dra-1"
72+
./demo/clusters/kind/build-dra-driver-gpu.sh
73+
./demo/clusters/kind/create-cluster.sh
13074
```
13175

132-
### Cleaning up the environment
76+
Now you can install the DRA driver's Helm chart into the Kubernetes cluster:
13377

134-
Remove the cluster created in the preceding steps:
13578
```console
136-
./demo/clusters/kind/delete-cluster.sh
79+
./demo/clusters/kind/install-dra-driver-gpu.sh
13780
```
13881

139-
<!--
140-
TODO: This README should be extended with additional content including:
82+
**Step 3: run workload**
14183

142-
## Information for "real" deployment including prerequesites
84+
Submit workload:
14385

144-
This may include the following content from the original scripts:
86+
```console
87+
kubectl apply -f ./demo/specs/quickstart/gpu-test2.yaml
14588
```
146-
set -e
14789

148-
export VERSION=v25.2.0
90+
If you're curious, have a look at [the `ResourceClaimTemplate`](https://github.com/jgehrcke/k8s-dra-driver-gpu/blob/526130fbaa3c8f5b1f6dcfd9ef01c9bdd5c229fe/demo/specs/quickstart/gpu-test2.yaml#L12) definition in this spec, and how the corresponding _single_ `ResourceClaim` is [being referenced](https://github.com/jgehrcke/k8s-dra-driver-gpu/blob/526130fbaa3c8f5b1f6dcfd9ef01c9bdd5c229fe/demo/specs/quickstart/gpu-test2.yaml#L46) by both containers.
14991

150-
REGISTRY=nvcr.io/nvidia
151-
IMAGE=k8s-dra-driver-gpu
152-
PLATFORM=ubi9
92+
Container log inspection then indeed reveals that both containers operate on the same GPU device:
15393

154-
sudo true
155-
make -f deployments/container/Makefile build-${PLATFORM}
156-
docker tag ${REGISTRY}/${IMAGE}:${VERSION}-${PLATFORM} ${REGISTRY}/${IMAGE}:${VERSION}
157-
docker save ${REGISTRY}/${IMAGE}:${VERSION} > image.tgz
158-
sudo ctr -n k8s.io image import image.tgz
94+
```bash
95+
$ kubectl logs pod -n gpu-test2 --all-containers --prefix
96+
[pod/pod/ctr0] GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c)
97+
[pod/pod/ctr1] GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c)
15998
```
16099

161-
## Information on advanced usage such as MIG.
100+
## Contributing
162101

163-
This includes setting configuring MIG on the host using mig-parted. Some of the demo scripts included
164-
in ./demo/ require this.
102+
Contributions require a Developer Certificate of Origin (DCO, see [CONTRIBUTING.md](https://github.com/NVIDIA/k8s-dra-driver-gpu/blob/main/CONTRIBUTING.md)).
165103

166-
```
167-
cat <<EOF | sudo -E nvidia-mig-parted apply -f -
168-
version: v1
169-
mig-configs:
170-
half-half:
171-
- devices: [0,1,2,3]
172-
mig-enabled: false
173-
- devices: [4,5,6,7]
174-
mig-enabled: true
175-
mig-devices: {}
176-
EOF
177-
```
178-
-->
104+
## Support
105+
106+
Please open an issue on the GitHub project for questions and for reporting problems.
107+
Your feedback is appreciated!

0 commit comments

Comments
 (0)