Skip to content

Conversation

jgehrcke
Copy link
Collaborator

@jgehrcke jgehrcke commented May 6, 2025

Generally -- as discussed with @klueska and @pohly -- resource cleanup (i.e., device unprepare) must not rely on ResourceClaim objects being available in the API server. Cleanup must built so that the only external input parameter is the ResourceClaim UID.

Any state that has to be maintained between Prepare and Unprepare (to ensure proper teardown) can of course be persisted node-locally. Checkpointing logic lends itself for maintaining such state.

Hence, this patch persists resource claim data upon Prepare() so that it can be re-discovered (by claim UID) during Unpreapre().

As such, this patch addresses #340.

Notably, we also don't want to fetch the claim object from the API server as an optimization or so. We have to figure out how to reliably do resource teardown using purely local state.

Initial testing looks good:

$ kubectl delete -f ./demo/specs/imex/channel-injection.yaml
computedomain.resource.nvidia.com "imex-channel-injection" deleted
pod "imex-channel-injection" deleted
± kubectl logs -n nvidia-dra-driver-gpu   nvidia-dra-driver-gpu-kubelet-plugin-vbv6v
...
I0506 10:48:01.503802       1 driver.go:149] UnprepareResourceClaims called with 1 claim(s)
I0506 10:48:01.510169       1 driver.go:222] unprepared devices for claim 'nvidia-dra-driver-gpu/imex-channel-injection-r4qtb-2ld9b-compute-domain-daemon-rzscg:36fa6b1d-ca97-4c72-813f-cc921ccfbe9d'
I0506 10:48:01.510206       1 nonblockinggrpcserver.go:159] "handling request succeeded" logger="dra" requestID=7 method="/k8s.io.kubelet.pkg.apis.dra.v1beta1.DRAPlugin/NodeUnprepareResources" response="&NodeUnprepareResourcesResponse{Claims:map[string]*NodeUnprepareResourceResponse{36fa6b1d-ca97-4c72-813f-cc921ccfbe9d: &NodeUnprepareResourceResponse{Error:,},},}"

Copy link

copy-pr-bot bot commented May 6, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@@ -12,14 +12,16 @@ type Checkpoint struct {
}

type CheckpointV1 struct {
PreparedClaims PreparedClaims `json:"preparedClaims,omitempty"`
PreparedClaims PreparedClaims `json:"preparedClaims,omitempty"`
PreparedRawClaims PreparedRawClaims `json:"preparedRawClaims,omitempty"`
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted for introducing a new top-level key for storing the ResourceClaim API objects. The type/name PreparedClaims was already taken.

Let's find a better name than PreparedRawClaims. PreparedResourceClaims? PreparedClaimObjects?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now repurposed the old name PreparedClaims.

What was previously called PreparedClaims is now called PreparedDevicesByClaimUID.

// device was never prepared (assume that Prepare+Checkpoint are done
// transactionally). Note that claimRef.String() contains namespace,
// name, UID.
klog.Infof("unprepare noop: claim not found in checkpoint data: %v", claimRef.String())
return nil
Copy link
Collaborator Author

@jgehrcke jgehrcke May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first, I returned an error. Then I tested a fast creation/deletion cycle:

$ kubectl apply -f ./demo/specs/imex/channel-injection.yaml

followed a few seconds later by

$ kubectl delete -f ./demo/specs/imex/channel-injection.yaml
computedomain.resource.nvidia.com "imex-channel-injection" deleted
pod "imex-channel-injection" deleted
^C

The delete hang forever. ComputeDomain cleanup was stuck in

E0506 11:09:43.251572       1 workqueue.go:99] Failed to reconcile work item: error unpreparing devices for claim 'default/imex-c
hannel-injection-imex-channel-0-f2p49:cfcde21e-a350-496e-ba28-6a4b60aefc9c': unprepare failed: claim not found in checkpoint data
: default/imex-channel-injection-imex-channel-0-f2p49:cfcde21e-a350-496e-ba28-6a4b60aefc9c

I suppose I deleted objects so early that the node-local plugin never got to Prepare() the device.

In that state, the claim is expectedly not referred to in the checkpoint data.

Hence, this should not be treated as an error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hence, this should be treated as an error.

Do you mean that this should NOT be treated as an error?

Copy link
Collaborator Author

@jgehrcke jgehrcke May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that this should NOT be treated as an error?

Yes :) I live in an inverted world. Edited.

}

if claim.Status.Allocation == nil {
return true, fmt.Errorf("no allocation set in ResourceClaim %s in namespace %s", claim.Name, claim.Namespace)
Copy link
Collaborator Author

@jgehrcke jgehrcke May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was it possible for Unprepare to be invoked for a claim with claim.Status.Allocation == nil?

As the patch currently stands, we only Unprepare actively when the claim has been previously checkpointed, and we only checkpoint a claim in Prepare, i.e. when Status.Allocation != nil.

When the claim has not been checkpointed we don't error out on Unprepare but pretend success via noop-unprepare. Does that make sense?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update, after confirming with Kevin: removal of the if claim.Status.Allocation == nil check makes sense.

About the noop-unprepare I am rather certain that's what we want but I want to check again with Kevin too.

@jgehrcke jgehrcke force-pushed the jp/unprepare-from-checkpoint branch from ec03632 to 9b7ac3c Compare May 7, 2025 09:19
@@ -12,14 +12,16 @@ type Checkpoint struct {
}

type CheckpointV1 struct {
PreparedClaims PreparedClaims `json:"preparedClaims,omitempty"`
PreparedDevicesByClaimUID PreparedDevicesByClaimUID `json:"PreparedDevicesByClaimUID,omitempty"`
PreparedClaims PreparedClaims `json:"preparedClaims,omitempty"`
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: watch out, I propose to use the old term PreparedClaims for something new (it is a great name for what I want to introduce here: a collection of resourceapi.ResourceClaim objects).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. This explains why the GPU plugin uses the NEW name -- since it doesn't have global claims like the compute domain driver.

@jgehrcke jgehrcke changed the title [WIP] Do not GET claim in CD unprepare (use local checkpoint) Do not GET claim in CD unprepare (use local checkpoint) May 7, 2025
@jgehrcke jgehrcke changed the title Do not GET claim in CD unprepare (use local checkpoint) Use only checkpoint data for CD unprepare (do not GET claim) May 7, 2025
@jgehrcke jgehrcke marked this pull request as ready for review May 7, 2025 09:41
Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple questions, LGTM

delete(preparedClaims, claimUID)

if err := s.checkpointManager.CreateCheckpoint(DriverPluginCheckpointFile, checkpoint); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious, should this be maybe called UpdateCheckpoint ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. However, we in fact create a new checkpoint here. A checkpoint is a self-contained snapshot of the entire state, atomically written.

That also means: persisting state after state mutation requires writing a new checkpoint.

The interface is given by https://github.com/kubernetes/kubernetes/blob/60a317eadfcb839692a68eab88b2096f4d708f4f/pkg/kubelet/checkpointmanager/checkpoint_manager.go :)

}
preparedDevicesByClaimUID := checkpoint.V1.PreparedDevicesByClaimUID
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used much later on, does it make sense to move this further down, closer to when it's used?

err := s.cdi.DeleteClaimSpecFile(claimUID)
if err != nil {
return fmt.Errorf("unable to delete CDI spec file for claim: %w", err)
}

delete(preparedDevicesByClaimUID, claimUID)
delete(preparedClaims, claimUID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Are preparedDevicesByClaimUID and preparedClaims references to the maps in the checkpoint object? Meaning, does deleting them here affect the checkpoint object so that creating it below writes the mutated state?

Copy link
Collaborator Author

@jgehrcke jgehrcke May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question. In preparedClaims := checkpoint.V1.PreparedClaims: is this a copy assignment operator (as we would call this in C++)? If it is then we mutate the copy and the mutation is not reflected in the checkpoint. I will test this, and also look this topic up in the Golang language spec.

Copy link
Collaborator Author

@jgehrcke jgehrcke May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test program:

package main

import (
	"encoding/json"
	"fmt"
)

type Checkpoint struct {
	V1 *CheckpointV1
}

type CheckpointV1 struct {
	SomeMap SomeMap
}

type SomeMap map[string]string

func main() {

	m := SomeMap{
		"eggs":  "yes",
		"bacon": "no",
	}

	c := Checkpoint{V1: &CheckpointV1{SomeMap: m}}

	out, _ := json.Marshal(c)
	fmt.Printf("%s\n", out)

	m2 := c.V1.SomeMap
	delete(m2, "eggs")

	out2, _ := json.Marshal(c)
	fmt.Printf("%s\n", out2)
}

}

Output:

$ go run assignment-test.go 
{"V1":{"SomeMap":{"bacon":"no","eggs":"yes"}}}
{"V1":{"SomeMap":{"bacon":"no"}}}

The mutation does apply to the checkpoint.

Why does it apply? I will read more.

Edit: I think I now understood that Go will not implicitly make a copy of your map payload when you don't very explicitly want to do so.

I looked at https://stackoverflow.com/a/31173039/145400 and also https://medium.com/@shanechristian03/lets-talk-about-reference-types-in-go-b44cf4938e2d.

It should be obvious by now that setting b[“four”] = 4 within function modify will also affect the original map b, because a map is essentially a pointer under the hood.

After that, we make a copy of a and assign it to b. Although b is a copy, both a and b still point to the same underlying array.

You will not find the term ‘reference types’ in the official Go language specification. The term is not officially supported in Go, most likely because it can cause confusion. However, it’s still widely used within the Go community.

if err := s.checkpointManager.CreateCheckpoint(DriverPluginCheckpointFile, checkpoint); err != nil {
return nil, fmt.Errorf("unable to sync to checkpoint: %v", err)
}

return preparedClaims[claimUID].GetDevices(), nil
return preparedDevicesByClaimUID[claimUID].GetDevices(), nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this changed between where we set it on line 153 and here? Is returning preparedDevices.GetDevices() clearer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My edit here was rather ignorant. I see. I think we can do what you propose. Will report back.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also OK to leave it as is if we feel it's out of scope for this PR.

@@ -153,42 +156,57 @@ func (s *DeviceState) Prepare(ctx context.Context, claim *resourceapi.ResourceCl
return nil, fmt.Errorf("unable to create CDI spec file for claim: %w", err)
}

preparedClaims[claimUID] = preparedDevices
// Add ResourceClaim API object to checkpoint so that for Unprepare() only
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there situations when we should also query the API server (e.g. if the local state is not sufficient?).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder the same. That's certainly a key question here.

There should not be such situations. Conceptually, we have to build proper cleanup w/o API server input (input other than ResourceClaim UID). I have edited the PR description a bit.

Comment on lines 140 to 141
if preparedDevicesByClaimUID[claimUID] != nil {
return preparedDevicesByClaimUID[claimUID].GetDevices(), nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: could also write this as:

Suggested change
if preparedDevicesByClaimUID[claimUID] != nil {
return preparedDevicesByClaimUID[claimUID].GetDevices(), nil
if preparedDevices, ok := preparedDevicesByClaimUID[claimUID]; ok {
return preparedDevices.GetDevices(), nil

Not sure where I stand on readability here though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also could be considered out of scope).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment anyway. Made me read a bit about "nil maps" and the zero value. I like the current readability, but it's important to understand of course what the current zero value is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done :) (we now return preparedDevices.GetDevices(), nil)


// key: stringified claim UUID
type PreparedDevicesByClaimUID map[string]PreparedDevices
type PreparedClaims map[string]resourceapi.ResourceClaim
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO after sync code review with Kevin: store only claim.Status field, not entire claim.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if preparedClaims[claimUID] != nil {
return preparedClaims[claimUID].GetDevices(), nil
// Has previously been prepared by us.
// (Under which circumstances does this get called again?)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: adjust comment (this may happen)


// Rely on local checkpoint state for ability to clean up.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On reliance on only checkpointing data: of course, checkpointing data is vulnerable to node failures (data may be lost in rare cases). But device state can be assumed to get lost / reset 'in sync' (e.g., as part of recovery from a failure affecting the checkpoint data it is likely for a reboot to happen).

@jgehrcke
Copy link
Collaborator Author

I updated the patch after discussion / review feedback. I tested the current state on Starwars. A bit of log output from there:

$ kubectl logs -n nvidia-dra-driver-gpu        nvidia-dra-driver-gpu-kubelet-plugin-vf69c | grep checkpoint
I0513 11:52:37.713374       1 device_state.go:193] unprepare noop: claim not found in checkpoint data: default/imex-channel-injection-imex-channel-0-mzs5m:92aed46e-20c1-4f96-8c7b-4750988147df
I0513 12:22:30.999820       1 device_state.go:193] unprepare noop: claim not found in checkpoint data: nvidia-dra-driver-gpu/imex-channel-injection-qkt2n-fdx6m-compute-domain-daemon-pkrnb:057eb479-27ff-4f97-b1c5-85bb7b59bbc1
I0513 12:33:08.278210       1 device_state.go:169] checkpoint written for claim 02faa1a6-f7c1-473d-98f3-8f5a1378b5a4
I0513 12:33:09.182360       1 device_state.go:169] checkpoint written for claim c515582f-eb9d-40c4-b6f3-3729110b4755
I0513 12:34:30.665436       1 device_state.go:169] checkpoint written for claim 41e2a7ee-8c00-49ba-b02a-d15d4099d89e

@@ -115,13 +115,13 @@ func NewDeviceState(ctx context.Context, config *Config) (*DeviceState, error) {
}

for _, c := range checkpoints {
if c == DriverPluginCheckpointFile {
if c == DriverPluginCheckpointFileBasename {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated change; just a variable rename. Was looking for the checkpoint file path by following code. Renamed this because a hard-learned lesson for me is to use the term "file" only for abstract file objects, and never for basenames or absolute paths. In this case, we store a basename.

@@ -147,7 +146,7 @@ func (d *driver) PrepareResourceClaims(ctx context.Context, claims []*resourceap
}

func (d *driver) UnprepareResourceClaims(ctx context.Context, claimRefs []kubeletplugin.NamespacedObject) (map[types.UID]error, error) {
klog.Infof("UnprepareResourceClaims called with %d claim(s)", len(claimRefs))
klog.V(6).Infof("UnprepareResourceClaims called with %d claim(s)", len(claimRefs))
Copy link
Collaborator Author

@jgehrcke jgehrcke May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we typically see pairs of these log messages:

I0513 12:35:44.198425       1 driver.go:149] UnprepareResourceClaims called with 1 claim(s)
I0513 12:35:44.204757       1 driver.go:210] unprepared devices for claim 'nvidia-dra-driver-gpu/imex-channel-injection-rmr8b-zh5nx-compute-domain-daemon-xhlq2:c515582f-eb9d-40c4-b6f3-3729110b4755'

Instead of removing the first msg, I for now just wanted to mark it as more 'debuggy' (lower severity).

return fmt.Errorf("unable to sync to checkpoint: %v", err)
// Write new checkpoint reflecting that all devices for this claim have been
// unprepared (by virtue of removing its UID from all mappings).
preparedDevicesByClaimUID := checkpoint.V1.PreparedDevicesByClaimUID
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:)

@@ -12,14 +12,16 @@ type Checkpoint struct {
}

type CheckpointV1 struct {
PreparedClaims PreparedClaims `json:"preparedClaims,omitempty"`
PreparedDevicesByClaimUID PreparedDevicesByClaimUID `json:"PreparedDevicesByClaimUID,omitempty"`
PreparedClaimStates PreparedClaimStates `json:"PreparedClaimStates,omitempty"`
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(from meeting with Kevin) maybe just call this Statuses

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

json property: camelCase 🐫

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

json property: camelCase

done

status, found := preparedClaimStates[claimUID]
if !found {
// Not an error: if this claim UID is not in the checkpoint then this
// device was never prepared (assume that Prepare+Checkpoint are done
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or has already been unprepared

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// device was never prepared (assume that Prepare+Checkpoint are done
// transactionally). Note that claimRef.String() contains namespace,
// name, UID.
klog.Infof("unprepare noop: claim not found in checkpoint data: %v", claimRef.String())
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

review capitalization of log msgs

// unprepared (by virtue of removing its UID from all mappings).
preparedDevicesByClaimUID := checkpoint.V1.PreparedDevicesByClaimUID
delete(preparedDevicesByClaimUID, claimUID)
delete(preparedClaimStates, claimUID)
Copy link
Collaborator Author

@jgehrcke jgehrcke May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe instead have just one top-level map, and every value in that map would be an object with a .Status and a .PreparedDevices property/field.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

func newCheckpoint() *Checkpoint {
pc := &Checkpoint{
Checksum: 0,
V1: &CheckpointV1{
PreparedClaims: make(PreparedClaims),
PreparedDevicesByClaimUID: make(PreparedDevicesByClaimUID),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the future: make gpu plugin and CD plugin code paths symmetric w.r.t. which data to checkpoint, and which structures to use during unprepare

@jgehrcke
Copy link
Collaborator Author

todo: restructure the checkpoint, to have one high-level map indexed by UID

@jgehrcke jgehrcke force-pushed the jp/unprepare-from-checkpoint branch from 355cea1 to f5f77a9 Compare May 15, 2025 13:44
@@ -12,14 +13,22 @@ type Checkpoint struct {
}

type CheckpointV1 struct {
PreparedClaims PreparedClaims `json:"preparedClaims,omitempty"`
PreparedClaimsByUID `json:"preparedClaims,omitempty"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The json field name should match the go field name (modulo the casing)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that. Of course.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 23 to 24
Status resourceapi.ResourceClaimStatus
PreparedDevices PreparedDevices
Copy link
Collaborator

@klueska klueska May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need json tags here to, otherwise the default to capitalized same as the go field name (this is true at every level of nesting)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. Right.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jgehrcke jgehrcke force-pushed the jp/unprepare-from-checkpoint branch from f5aa817 to a025dd1 Compare May 15, 2025 15:05
@@ -12,14 +13,22 @@ type Checkpoint struct {
}

type CheckpointV1 struct {
PreparedClaims PreparedClaims `json:"preparedClaims,omitempty"`
PreparedClaimsByUID `json:"preparedClaimsByUID,omitempty"`
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I forgot a PreparedClaimsByUID type/name definition here (depending on how you look at this). But this compiles and works. Why?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems I accidentally created an embedded field here.

A field declared with a type but no explicit field name is called an embedded field

The unqualified type name acts as the field name.

https://go.dev/ref/spec#Struct_types

This works, but we don't want this here.

@jgehrcke jgehrcke force-pushed the jp/unprepare-from-checkpoint branch from a025dd1 to a3391aa Compare May 15, 2025 15:29
@jgehrcke
Copy link
Collaborator Author

jgehrcke commented May 15, 2025

Thanks for great feedback and discussion. I manually tested the last commit on this branch on Starwars (for both, GPUs and CDs, and involving prepare and unprepare).

Log excerpt for CD:

I0515 15:34:16.810113       1 driver.go:119] PrepareResourceClaims called with 1 claim(s)
...
I0515 15:34:16.822785       1 device_state.go:170] checkpoint written for claim 73de0384-716f-480e-8caa-00bce7016114
I0515 15:34:16.822797       1 driver.go:198] Returning newly prepared devices for claim '73de0384-716f-480e-8caa-00bce7016114': [{[daemon] sc-starwars-mab6-b00 daemon-0 [k8s.compute-domain.nvidia.com/device=all k8s.compute-domain.nvidia.com/claim=73de0384-716f-480e-8caa-00bce7016114-daemon-0]}]
...
I0515 15:36:49.578057       1 driver.go:149] UnprepareResourceClaims called with 1 claim(s)
I0515 15:36:49.585093       1 driver.go:210] unprepared devices for claim 'nvidia-dra-driver-gpu/imex-channel-injection-psg26-rw7hw-compute-domain-daemon-85n8l:73de0384-716f-480e-8caa-00bce7016114'
...

(inconsistent capitalization of log msgs: will look into this in a separate patch).

Also had a brief look at the JSON anatomy in the checkpoint file. Looks sane (for the record, two bad screenshots:
https://github.com/user-attachments/assets/35ac0266-98cd-45e0-b227-42a3b5ffe76a, https://github.com/user-attachments/assets/ccdd6139-58a7-4646-b683-4727241afce3)

@@ -12,14 +13,22 @@ type Checkpoint struct {
}

type CheckpointV1 struct {
PreparedClaims PreparedClaims `json:"preparedClaims,omitempty"`
PreparedClaimsByUID PreparedClaimsByUID `json:"preparedClaimsByUID,omitempty"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we drop the ByUID in the field name (but keep it in the type)? If you feel strongly we can keep it, but it feels a bit awkward to have this detail exposed in the actual checkpointed file (where its contents will clearly be indexing here by uuid).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done! I like the shorter name.

@jgehrcke jgehrcke force-pushed the jp/unprepare-from-checkpoint branch from a3391aa to 35a4158 Compare May 16, 2025 09:28
Copy link
Collaborator

@klueska klueska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the back-on-forth on this.

Looking great now.

/lgtm
/approve

@jgehrcke jgehrcke merged commit 79e0546 into NVIDIA:main May 16, 2025
7 checks passed
@klueska klueska added this to the v25.3.0 milestone Aug 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

Successfully merging this pull request may close these issues.

4 participants