Releases: NVIDIA/k8s-dra-driver-gpu
v25.3.0
This release marks the general availability of NVIDIA's DRA Driver for GPUs. It focuses on ComputeDomains, for robust and secure Multi-Node NVLink (MNNVL) for NVIDIA GB200 and similar systems. Official support for GPU allocation will be part of the next release.
No functional changes were added compared to the last release candidate.
Documentation for this DRA driver can be found here.
For background on how ComputeDomains enable support for MNNVL workloads on Kubernetes (and on NVIDIA GB200 systems in particular), see this doc, this slide deck, and this conference talk.
An outlook on upcoming changes can be found here, and our GitHub milestones provide a more fine-grained view into features planned for upcoming releases.
v25.3.0-rc.5
Commits since last release: v25.3.0-rc.4...v25.3.0-rc.5.
Fixes
- A race condition in the
ComputeDomain
controller was fixed which allowed for uncontrolled creation of objects under informer lag (#440, #441). - The device name used to represent an individual GPU device in a
ResourceSlice
has been stabilized (#427, #428). - Processing of device requests has been consolidated to not overlap with other DRA drivers (#435).
Improvements
- The GPU allocation side of the driver now supports the
pcieRoot
attribute, allowing for topologically aligned resource allocation across drivers (for e.g. GPU-NIC alignment, see #213, #400, #401, #429). - The
nvidiaCDIHookPath
Helm chart parameter has been added to allow for using for a pre-installednvidia-cdi-hook
executable instead of the embedded one (#430).
Notable changes
- Updated the bundled NVIDIA Container Toolkit to
v1.18.0-rc.2
(see release notes for rc.1, rc.2). - Driver internals are now based on the
k8s.io/api/resource/v1
types and use latest components from the upcoming Kubernetes 1.34 release (#429). - Other, minor dependency updates were included (#422, #403, #437).
v25.3.0-rc.4
Release notes
Commits since last release: v25.3.0-rc.3...v25.3.0-rc.4. Changes are summarized below.
Fixes
- The logic for removing stale
ComputeDomain
node labels has been fixed and consolidated, which is especially important when workload pods are created and then deleted again in rapid succession (#404). - A
ComputeDomain
update (an IMEX daemon Pod IP change) was not reliably leading to daemon restart (#407). - The IMEX daemon's liveness probe's stderr was not collected (#407).
- The IMEX daemon's log output was not reliably collected, especially around shutdown (#407).
- Controller pod and IMEX daemon pods now explicitly run with
NVIDIA_VISIBLE_DEVICES=void
which addresses various error symptoms in some environments (#402).
Notable changes
- Container images are now based on
nvcr.io/nvidia/cuda:12.9.1-base-ubi9
.
v25.3.0-rc.3
Release notes
This release is an important milestone towards the general availability of the NVIDIA DRA Driver for GPUs. It focuses on improving support for NVIDIA's Multi-Node NVLink (MNNVL) in Kubernetes by delivering a number of ComputeDomain
improvements and bug fixes.
All commits since the last release can be seen here: v25.3.0-rc.2...v25.3.0-rc.3. The changes are summarized below.
For background on how ComputeDomain
s enable support for MNNVL workloads on Kubernetes (and on NVIDIA GB200 systems in particular), see this doc and this slide deck.
Improvements
- More predictable
ComputeDomain
cleanup semantics: deletion of aComputeDomain
is now immediately followed by resource teardown (instead of waiting for workload to complete). - Troubleshooting improvement: a new init container helps users set a correct value for the
nvidiaDriverRoot
Helm chart variable and overcome common GPU driver setup issues. - All driver components are now based on the same container image (configurable via Helm chart variable). This removes a dependency on Docker Hub and generally helps with compliance and reliability.
- IMEX daemons orchestrated by a
ComputeDomain
now communicate via Pod IP (using a virtual overlay network instead of usinghostnetwork: true
) to improve robustness and security. - The dependency on a pre-provisioned NVIDIA Container Toolkit was removed.
Fixes
ComputeDomain
teardown now works even after a correspondingResourceClaim
was removed from the API server (#342).- Fixed an issue where the IMEX daemon startup probe failed with “family not supported by protocol“ (#328).
- Pod labels were adjusted so that e.g.
kubectl logs ds/nvidia-dra-driver-gpu-kubelet-plugin
actually yields plugin logs (#355). - The IMEX daemon startup probe is now less timing-sensitive (d1f7c).
- Other minor fixes: #321, #334.
Notable changes
- Introduced an IMEX daemon wrapper allowing for more robust and flexible daemon reconfiguration and monitoring.
- Added support for the NVIDIA GPU Driver 580.x releases.
- Added support for the Blackwell GPU architecture in the GPU plugin of the DRA driver.
- The DRA library was updated to
v0.33.0
(cf. changes) for various robustness improvements (such as for more reliable rolling upgrades).
Breaking changes
- The
nvidiaCtkPath
Helm chart variable does not need to be provided anymore (see above); doing so now results in an error.
The path forward
ComputeDomains
Future versions of the NVIDIA GPU driver (580+) will include IMEX daemons with support for communicating using DNS names in addition to raw IP addresses. This feature allows us to overcome a number of limitations inherent to the existing ComputeDomain
implementation – with no breaking changes to the user-facing API.
Highlights include:
-
Removal of the
numNodes
field in theComputeDomain
abstraction. Users will no longer need to pre-calculate how many nodes their (static) multi-node workload will ultimately span. -
Support for elastic workloads. The number of pods associated with a mulit-node workload will no longer need to be fixed and forced to match the value of the
numNodes
field in theComputeDomain
the workload is running in. -
Support for running more than one pod per
ComputeDomain
on a given node. As of now, all pods of a multi-node workload are artificially forced to run on different nodes, even if there are enough GPUs on a single node to service more than one of them. This new feature will remove this restriction. -
Support for running pods from different
ComputeDomain
s on the same node. As of now, only one pod from any multi-node workload is allowed to run on a given node associated with aComputeDomain
(even if there are enough GPUs available to service more than one of them). This new feature will remove this restriction. -
Improved tolerance to node failures within an IMEX domain. As of now, if one node of an IMEX domain goes down, the entire workload needs to be shut down and rescheduled. This new feature will allow the failed node to be swapped in-place, without needing to shut down the entire IMEX domain (of course, many types of failures may still require the workloads to restart anyway to explicitly recover from a loss of state).
We also plan on adding improvements to the general debuggability and observability of ComputeDomain
s, including:
- Proper definition of a set of high-level states that a
ComputeDomain
can be in to allow for robust automation. - Export of metrics to allow for monitoring health and performance.
- Actionable error messages and description strings as well as improved component logging for facilitating troubleshooting.
GPUs
The upcoming 25.3.0 release will not include official support for allocating GPUs (only ComputeDomain
s). In the following release (25.8.0), we will add official support for allocating GPUs. This 25.8.0 release will be integrated with the NVIDIA GPU Operator and does not need to be installed as a standalone Helm chart anymore.
Note: The DRA feature in upstream Kubernetes is slated to go GA in August. The 25.8.0 release of the NVIDIA DRA driver for GPUs is planned to coincide with that.
Features we plan to include in the 25.8.0 release:
- GPU selection via complex constraints
- Support for having multiple GPU types per node
- Controlled GPU sharing via ResourceClaims
- User-mediated Time-Tlicing across a subset of GPUs on a node
- User-mediated MPS sharing across a subset of GPUs on a node
- Allocation of statically partitioned MIG devices
- Custom policies to align multiple resource types (e.g. GPUs, CPUs, and NICs)
Features for future releases in the near term:
- Dynamic allocation of MIG devices
- System-mediated sharing of GPUs via Time-slicing and MPS
- “Management” pods with access to all GPUs / MIG devices without allocating them
- Dynamic swapping of NVIDIA driver with vfio driver depending on intended use of GPU
- Ability to use DRA to allocate GPUs with “traditional” API (e.g.
nvidia.com/gpu
: 2
)