Skip to content

Dmabuf IPC support for amdgpu plugin (v2) #2613

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: criu-dev
Choose a base branch
from

Conversation

fdavid-amd
Copy link
Contributor

This patch set, in combination with the accompanying kernel patch set, is the
second draft of work to support amdgpu dmabuf IPC. Since the first post, the
patches have been split to make them more readable and made
backwards-compatible.

Copy link

A friendly reminder that this PR had no activity for 30 days.

@avagin avagin added no-auto-close Don't auto-close as a stale issue and removed stale-pr labels May 19, 2025
@avagin
Copy link
Member

avagin commented May 19, 2025

@fdavid-amd how is it going? Let me know if you need help with this pr.

message criu_render_node {
required uint32 gpu_id = 1;
required uint32 id = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like these changes break compatibility - a newer version of CRIU would not be able to restore a checkpoint created with a previous version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true - I wasn't aware that was a type of backwards-compatibility we committed to.

@fdavid-amd fdavid-amd force-pushed the dmabuf-post branch 4 times, most recently from b057779 to 456978a Compare June 18, 2025 18:12
amdgpu represents allocated device memory as a memory mapping
of the device file. This is a non-standard VMA that must
be handled by the plugin, not the normal VMA code.

Ignore all VMAs on device files.

Signed-off-by: David Francis <[email protected]>
amdgpu dmabuf CRIU requires the ability of the amdgpu plugin
to retry.

Change files_ext.c to read a response of 1 from a plugin restore
function to mean retry.

Signed-off-by: David Francis <[email protected]>
@fdavid-amd fdavid-amd force-pushed the dmabuf-post branch 3 times, most recently from 45f796c to 878a313 Compare July 8, 2025 14:05
@fdavid-amd
Copy link
Contributor Author

@avagin
Sorry for leaving this untouched for so long, I was revising the kernel side of the changes. Could you take a look at this again?

@fdavid-amd
Copy link
Contributor Author

Some of the build failures I don't understand - they might be an issue with this patch set or they might be an issue with the CI. There's a lot of 'Error: ipv6: address already assigned.', which could possibly be a result of the way that I'm allocating the plugin socket but seems unlikely.
The linter errors I do understand - they're flagging real grammar and syntax errors in the amdgpu-drm.h and drm.h files - but the point of those files is that they're exact copies of the kernel header files of the same names, so I don't want to change them.

@fdavid-amd
Copy link
Contributor Author

@avagin sorry for the repeat ping, but there is something of a rush on my part to get kernel patch https://lists.freedesktop.org/archives/dri-devel/2025-July/514922.html merged in the next week or so (it adds the CHANGE_HANDLE ioctl). As part of merging that patch, I need to demonstrate that it has written and reasonably stable userspace code that uses it (which would be these CRIU patches). It would be a big help to me if you could say if the general approach taken by these patches is acceptable or unacceptable.
Thanks,
David

@avagin
Copy link
Member

avagin commented Jul 15, 2025

@fdavid-amd The general approach looks good to me. We just need to adjust how we manage file descriptors, but it is just a criu internal thing.


list_for_each_entry(p, &amdgpu_processes, l) {
if (p->pid == pid) {
dmabuf_socket_fd = find_unused_fd_pid(pid);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

find_unused_fd_pid will not work without adding a few file entry (look at add_fake_fle). We can use fdstore, but I think it might be unneeded. Pls read other comment first.

else if (ret)
return -1;

id = fdstore_add(fd);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once you put a file descriptor in the fdstore, it becomes available to all other processes. This means you don't need to send it to each process separately; you can just put it in the fdstore once. You can use shared memory to synchronize/communicate between processes, so you probably don't need to create dmabuf_socket_fd sockets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. Is there an example of shared memory between CRIU processes without sockets that I could reference?

@avagin
Copy link
Member

avagin commented Jul 16, 2025

It would be a big help to me if you could say if the general approach taken by these patches is acceptable or unacceptable.

BTW we can merge the current version to the criu-dev branch and do all other adjustments on top of it.

@fdavid-amd
Copy link
Contributor Author

This code is currently not functional because it relies on ioctls that are not merged into any kernel branch; is is still okay to merge?

For amdgpu plugin to call the new amdgpu drm CRIU ioctls, it needs
the amdgpu drm header file, copied from the kernel's includes.

Signed-off-by: David Francis <[email protected]>
The amdgpu plugin usually calls drm ioctls through the libdrm
wrappers. However, amdgpu restore requires dealing with dmabufs
and gem handles directly, which means drm ioctls must be
called directly.

Add the drm.h header (from the kernel's uapi).

Signed-off-by: David Francis <[email protected]>
Buffer objects held by the amdgpu drm driver are checkpointed with
the new BO_INFO and MAPPING_INFO ioctls/ioctl options. Handling
is in amdgpu_plugin_drm.h

Handling of imported buffer objects may require dmabuf fds to be
transferred between processes. These occur over fdstore, with the
handle-fstore id relationships kept in shread memory. There is a
new plugin callback: RESTORE_INIT to create the shared memory.

During checkpoint, track shared buffer objects, so that buffer objects
that are shared across processes can be identified.

During restore, track which buffer objects have been restored. Retry
restore of a drm file if a buffer object is imported and the
original has not been exported yet. Skip buffer objects that have
already been completed or cannot be completed in the current restore.

So drm code can use sdma_copy_bo, that function no longer requires
kfd bo structs

Update the protobuf messages with new amdgpu drm information.

Signed-off-by: David Francis <[email protected]>
The amdgpu plugin was counting how many files were checkpointed
to determine when it should close the device files.

The number of device files is not consistent; a process may
have multiple copies of the drm device files open.

Instead of doing this counting, add a new callback after all
files are checkpointed, so plugins can clean up their
resources at an appropriate time.

Signed-off-by: David Francis <[email protected]>
@fdavid-amd fdavid-amd force-pushed the dmabuf-post branch 2 times, most recently from 3bace7e to 84fb396 Compare August 7, 2025 13:59
@rst0git
Copy link
Member

rst0git commented Aug 8, 2025

This code is currently not functional because it relies on ioctls that are not merged into any kernel branch; is is still okay to merge?

The way we usually handle this problem in CRIU is by checking if the required kernel functionality is available and disabling features that are not supported. However, it might be better to wait until the ioctls are merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-auto-close Don't auto-close as a stale issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants