Skip to content

Conversation

rst0git
Copy link
Member

@rst0git rst0git commented Jul 31, 2024

The plugin hook PAUSE_DEVICES was recently introduced in a85f488. This hook was intended to execute the cuda-checkpoint tool before the process tree is frozen. However, the run_plugins() call has been placed immediately after freeze_processes(). This causes the cuda-checkpoint tool to hang indefinitely during the checkpointing of CUDA applications running in containers, eventually leading to its termination by the timeout alarm.

This problem can be reproduced with the following example:

sudo podman run -d --rm \
        --device nvidia.com/gpu=all --security-opt=label=disable \
        quay.io/radostin/cuda-counter

sudo podman container checkpoint -l -e /tmp/checkpoint.tar

The plugin hook "PAUSE_DEVICES" was recently introduced in the following
commit. This hook was intended to execute the cuda-checkpoint tool
before the process tree is frozen. However, the run_plugins() call has
been placed immediately *after* freeze_processes(). This causes the
cuda-checkpoint tool to hang indefinitely during the checkpointing
of CUDA applications running in containers, eventually leading to its
termination by the timeout alarm.

a85f488
criu/plugin: Introduce new plugin hooks PAUSE_DEVICES and CHECKPOINT_DEVICES to be used during pstree collection

This problem can be reproduced with the following example:

sudo podman run -d --rm \
        --device nvidia.com/gpu=all --security-opt=label=disable \
        quay.io/radostin/cuda-counter

sudo podman container checkpoint -l -e /tmp/checkpoint.tar

Signed-off-by: Radostin Stoyanov <[email protected]>
@rst0git rst0git requested review from jesus-ramos and avagin July 31, 2024 21:50
Copy link
Contributor

@jesus-ramos jesus-ramos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I guess I missed running through podman checkpoint and was just running criu against the host pid. Good catch.

@avagin avagin merged commit 7a27427 into checkpoint-restore:criu-dev Aug 7, 2024
32 of 38 checks passed
@rst0git rst0git deleted the cuda-fix-pause-devices branch August 7, 2024 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants