-
Notifications
You must be signed in to change notification settings - Fork 654
Dmabuf IPC support for amdgpu plugin (v2) #2613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: criu-dev
Are you sure you want to change the base?
Conversation
A friendly reminder that this PR had no activity for 30 days. |
@fdavid-amd how is it going? Let me know if you need help with this pr. |
message criu_render_node { | ||
required uint32 gpu_id = 1; | ||
required uint32 id = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like these changes break compatibility - a newer version of CRIU would not be able to restore a checkpoint created with a previous version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is true - I wasn't aware that was a type of backwards-compatibility we committed to.
b057779
to
456978a
Compare
amdgpu represents allocated device memory as a memory mapping of the device file. This is a non-standard VMA that must be handled by the plugin, not the normal VMA code. Ignore all VMAs on device files. Signed-off-by: David Francis <[email protected]>
amdgpu dmabuf CRIU requires the ability of the amdgpu plugin to retry. Change files_ext.c to read a response of 1 from a plugin restore function to mean retry. Signed-off-by: David Francis <[email protected]>
45f796c
to
878a313
Compare
@avagin |
Some of the build failures I don't understand - they might be an issue with this patch set or they might be an issue with the CI. There's a lot of 'Error: ipv6: address already assigned.', which could possibly be a result of the way that I'm allocating the plugin socket but seems unlikely. |
@avagin sorry for the repeat ping, but there is something of a rush on my part to get kernel patch https://lists.freedesktop.org/archives/dri-devel/2025-July/514922.html merged in the next week or so (it adds the CHANGE_HANDLE ioctl). As part of merging that patch, I need to demonstrate that it has written and reasonably stable userspace code that uses it (which would be these CRIU patches). It would be a big help to me if you could say if the general approach taken by these patches is acceptable or unacceptable. |
@fdavid-amd The general approach looks good to me. We just need to adjust how we manage file descriptors, but it is just a criu internal thing. |
plugins/amdgpu/amdgpu_plugin.c
Outdated
|
||
list_for_each_entry(p, &amdgpu_processes, l) { | ||
if (p->pid == pid) { | ||
dmabuf_socket_fd = find_unused_fd_pid(pid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
find_unused_fd_pid will not work without adding a few file entry (look at add_fake_fle). We can use fdstore, but I think it might be unneeded. Pls read other comment first.
plugins/amdgpu/amdgpu_plugin.c
Outdated
else if (ret) | ||
return -1; | ||
|
||
id = fdstore_add(fd); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once you put a file descriptor in the fdstore, it becomes available to all other processes. This means you don't need to send it to each process separately; you can just put it in the fdstore once. You can use shared memory to synchronize/communicate between processes, so you probably don't need to create dmabuf_socket_fd sockets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. Is there an example of shared memory between CRIU processes without sockets that I could reference?
BTW we can merge the current version to the criu-dev branch and do all other adjustments on top of it. |
This code is currently not functional because it relies on ioctls that are not merged into any kernel branch; is is still okay to merge? |
For amdgpu plugin to call the new amdgpu drm CRIU ioctls, it needs the amdgpu drm header file, copied from the kernel's includes. Signed-off-by: David Francis <[email protected]>
The amdgpu plugin usually calls drm ioctls through the libdrm wrappers. However, amdgpu restore requires dealing with dmabufs and gem handles directly, which means drm ioctls must be called directly. Add the drm.h header (from the kernel's uapi). Signed-off-by: David Francis <[email protected]>
Buffer objects held by the amdgpu drm driver are checkpointed with the new BO_INFO and MAPPING_INFO ioctls/ioctl options. Handling is in amdgpu_plugin_drm.h Handling of imported buffer objects may require dmabuf fds to be transferred between processes. These occur over fdstore, with the handle-fstore id relationships kept in shread memory. There is a new plugin callback: RESTORE_INIT to create the shared memory. During checkpoint, track shared buffer objects, so that buffer objects that are shared across processes can be identified. During restore, track which buffer objects have been restored. Retry restore of a drm file if a buffer object is imported and the original has not been exported yet. Skip buffer objects that have already been completed or cannot be completed in the current restore. So drm code can use sdma_copy_bo, that function no longer requires kfd bo structs Update the protobuf messages with new amdgpu drm information. Signed-off-by: David Francis <[email protected]>
The amdgpu plugin was counting how many files were checkpointed to determine when it should close the device files. The number of device files is not consistent; a process may have multiple copies of the drm device files open. Instead of doing this counting, add a new callback after all files are checkpointed, so plugins can clean up their resources at an appropriate time. Signed-off-by: David Francis <[email protected]>
3bace7e
to
84fb396
Compare
The way we usually handle this problem in CRIU is by checking if the required kernel functionality is available and disabling features that are not supported. However, it might be better to wait until the ioctls are merged. |
This patch set, in combination with the accompanying kernel patch set, is the
second draft of work to support amdgpu dmabuf IPC. Since the first post, the
patches have been split to make them more readable and made
backwards-compatible.