-
Notifications
You must be signed in to change notification settings - Fork 161
Add more logs to debug nvme servicing CI failures #2482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
mattkur
commented
Nov 22, 2025
- user_driver: backoff 250ms in set_keep_alive, just like open_device
- copilot pr feedback
- nvme_manager: re-enable multi-device servicing test, add some more logs to try and track down the problem
- more logs
|
This PR modifies files containing For more on why we check whole files, instead of just diffs, check out the Rustonomicon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This is a debugging PR (marked "DO NOT MERGE") aimed at investigating NVMe servicing CI test failures. The changes add comprehensive logging and implement async retry logic to handle race conditions in VFIO device initialization.
Key Changes:
- Converted VFIO
open_deviceandset_keep_alivefunctions from synchronous to async to enable proper async delays instead of blocking thread sleeps - Added retry logic with 250ms backoff to
set_keep_aliveto handle ENODEV race conditions, matching existing behavior inopen_device - Re-enabled the
many_nvme_devices_servicing_heavytest that was previously ignored due to reliability issues - Added debug/info logging at critical points in NVMe manager initialization and worker lifecycle
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| vmm_tests/vmm_tests/tests/tests/x86_64/openhcl_linux_direct.rs | Re-enables previously ignored multi-device NVMe servicing test |
| vm/devices/user_driver/vfio_sys/src/lib.rs | Converts open_device and set_keep_alive to async with retry logic for ENODEV handling; improves logging with PCI ID context |
| vm/devices/user_driver/vfio_sys/Cargo.toml | Adds pal_async dependency required for async timer functionality |
| vm/devices/user_driver/src/vfio.rs | Updates callers to use async open_device and set_keep_alive functions |
| openhcl/underhill_core/src/worker.rs | Adds debug logging after NVMe manager initialization |
| openhcl/underhill_core/src/nvme_manager/manager.rs | Adds info logging for manager startup parameters and debug logging before main loop |
| Cargo.lock | Reflects new pal_async dependency in vfio_sys package |