Skip to content

Conversation

klueska
Copy link
Collaborator

@klueska klueska commented Jul 15, 2025

The device minor is more stable than the index in terms of not changing once a node has been brought online. The minor number may change across a node reboot, but it wont change if one of the GPUs falls off the bus or is explicitly drained / blocklisted from the driver (whereas the index will adjust for any missing GPUs).

Fixes #427

The device minor is more stable than the index in terms of not changing
once a node has been brought online. The minor number may change across
a node reboot, but it wont change if one of the GPUs falls off the bus
or is explicitly drained / blocklisted from the driver (whereas the
index will adjust for any missing GPUs).

Signed-off-by: Kevin Klues <[email protected]>
Copy link

copy-pr-bot bot commented Jul 15, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link
Collaborator

@jgehrcke jgehrcke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 LGTM

Thanks for exhaustive discussion on the corresponding issue.

@hase1128
Copy link

When will this PR merged?
I want to test the latest version of NVIDIA DRA driver.

@elezar
Copy link
Member

elezar commented Jul 31, 2025

@klueska should a similar change be made in the CDI spec generation logic?

Copy link
Collaborator

@jgehrcke jgehrcke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get this in.

@jgehrcke jgehrcke merged commit 1207db6 into NVIDIA:main Jul 31, 2025
7 checks passed
@klueska klueska added this to the v25.3.0 milestone Aug 13, 2025
@klueska klueska deleted the id-as-minor branch August 20, 2025 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

Successfully merging this pull request may close these issues.

The GPU Names within ResourceSlice change unintentionally
4 participants