-
Notifications
You must be signed in to change notification settings - Fork 73
Fix rocprofiler-sdk async copy timeout on queue destruction #1591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
bwelton
wants to merge
7
commits into
develop
Choose a base branch
from
bewelton/rocprof-ci-testing
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add async_copy_sync() call to QueueController::destroy_queue() to ensure all pending async memory copy operations complete before destroying the queue. This prevents the 30-second timeout that occurs when callbacks are waiting on a destroyed queue. The fix ensures callbacks are delivered before queue teardown by syncing all async copies globally before individual queue destruction.
52f6332 to
d563548
Compare
When registration finalization has already started, the early return path in async_copy_handler was not decrementing the active signals counter before deleting the copy data. This causes async_copy_sync() to timeout waiting for callbacks that will never fire. Add fetch_sub(1) before deletion to properly maintain the signal counter.
Move async_copy_sync() call after queue->sync() to ensure queue operations complete before waiting for async copy callbacks. This prevents race conditions where callbacks may be delivered after the queue sync but before the async copy sync completes.
This commit implements a more granular synchronization mechanism for async memory copy operations to prevent premature signal destruction while async handlers are still executing. Changes: - Add pending_signal_registry to track signals with active async handlers - Register each signal before installing its async handler callback - Unregister signals when handlers complete - Intercept hsa_signal_destroy to wait for pending handlers - Remove global async_copy_sync call from queue destruction The new approach eliminates the race condition where signals could be destroyed before their completion callbacks were delivered, which was causing timeout issues in CI tests. Implementation details: - pending_signal_registry uses Synchronized<> for thread-safe access - Each async_copy_data includes a completion flag for wait synchronization - hsa_signal_destroy interception added in hsa_api_impl::functor - Signal registration happens BEFORE hsa_amd_signal_async_handler_fn to avoid race where handler completes before registration
bwelton
commented
Oct 31, 2025
| } | ||
|
|
||
| // Check if signal has pending handler and wait if needed | ||
| void wait_for_signal(hsa_signal_t signal) |
Contributor
Author
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const
…race Store HSA function pointers in async_copy_data structure at registration time instead of looking them up from tables during handler execution. This prevents null pointer crashes when handlers are invoked during shutdown after static table objects have been destroyed. Changes: - Add 4 function pointer fields to async_copy_data - Capture function pointers at handler registration time - Use captured pointers in async_copy_handler instead of table lookups - Add defensive null check in active_signals::fetch_sub This eliminates the TOCTOU race condition and ensures handlers can execute safely even during shutdown sequence.
…tirement Call async_copy_sync() at the start of async_copy_fini() before setting the finalization flag. This ensures async copy handlers complete normally and properly retire their correlation IDs through the standard code path. Changes: - Move async_copy_sync() call to beginning of async_copy_fini() - Set finalization flag after async copies complete - Reduce grace period from 100ms to 10ms (handlers already completed) This fixes the kernel-tracing-validate correlation ID mismatch where 3 correlation IDs were not being retired (145958 vs 145961 expected).
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
rocprofiler-sdk buffered-api-tracing tests were timing out after 30 seconds waiting for async memory copy callbacks. The root cause was that
hipDeviceReset()destroys queues via reference counting, which removes queues from active processing before callbacks are delivered.Root Cause
When
hipDeviceReset()is called:QueueController::destroy_queue()is invoked for each queueasync_copy_fini()times out waiting for callbacks (30 seconds)Solution
Add
async_copy_sync()call toQueueController::destroy_queue()to ensure all pending async memory copy operations complete before destroying any queue.Changes
#include "lib/rocprofiler-sdk/hsa/async_copy.hpp"to queue_controller.cppasync_copy_sync()call before queue destruction inQueueController::destroy_queue()This minimal 2-line change ensures callbacks are delivered before queue teardown, preventing the 30-second timeout.