Skip to content

Conversation

@mattkur
Copy link
Contributor

@mattkur mattkur commented Nov 22, 2025

  • user_driver: backoff 250ms in set_keep_alive, just like open_device
  • copilot pr feedback
  • nvme_manager: re-enable multi-device servicing test, add some more logs to try and track down the problem
  • more logs

Copilot AI review requested due to automatic review settings November 22, 2025 22:02
@mattkur mattkur requested review from a team as code owners November 22, 2025 22:02
@github-actions github-actions bot added the unsafe Related to unsafe code label Nov 22, 2025
@github-actions
Copy link

⚠️ Unsafe Code Detected

This PR modifies files containing unsafe Rust code. Extra scrutiny is required during review.

For more on why we check whole files, instead of just diffs, check out the Rustonomicon

Copilot finished reviewing on behalf of mattkur November 22, 2025 22:04
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This is a debugging PR (marked "DO NOT MERGE") aimed at investigating NVMe servicing CI test failures. The changes add comprehensive logging and implement async retry logic to handle race conditions in VFIO device initialization.

Key Changes:

  • Converted VFIO open_device and set_keep_alive functions from synchronous to async to enable proper async delays instead of blocking thread sleeps
  • Added retry logic with 250ms backoff to set_keep_alive to handle ENODEV race conditions, matching existing behavior in open_device
  • Re-enabled the many_nvme_devices_servicing_heavy test that was previously ignored due to reliability issues
  • Added debug/info logging at critical points in NVMe manager initialization and worker lifecycle

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated no comments.

Show a summary per file
File Description
vmm_tests/vmm_tests/tests/tests/x86_64/openhcl_linux_direct.rs Re-enables previously ignored multi-device NVMe servicing test
vm/devices/user_driver/vfio_sys/src/lib.rs Converts open_device and set_keep_alive to async with retry logic for ENODEV handling; improves logging with PCI ID context
vm/devices/user_driver/vfio_sys/Cargo.toml Adds pal_async dependency required for async timer functionality
vm/devices/user_driver/src/vfio.rs Updates callers to use async open_device and set_keep_alive functions
openhcl/underhill_core/src/worker.rs Adds debug logging after NVMe manager initialization
openhcl/underhill_core/src/nvme_manager/manager.rs Adds info logging for manager startup parameters and debug logging before main loop
Cargo.lock Reflects new pal_async dependency in vfio_sys package

@github-actions
Copy link

@mattkur mattkur added the release-ci-required Add to a PR to trigger PR gates in release mode label Nov 23, 2025
@benhillis benhillis changed the title DO NOT MERGE: more logs to debug nvme servicing CI failures Add more logs to debug nvme servicing CI failures Nov 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-ci-required Add to a PR to trigger PR gates in release mode unsafe Related to unsafe code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant