Skip to content

Conversation

@bwelton
Copy link
Contributor

@bwelton bwelton commented Oct 29, 2025

Problem

rocprofiler-sdk buffered-api-tracing tests were timing out after 30 seconds waiting for async memory copy callbacks. The root cause was that hipDeviceReset() destroys queues via reference counting, which removes queues from active processing before callbacks are delivered.

Root Cause

When hipDeviceReset() is called:

  1. QueueController::destroy_queue() is invoked for each queue
  2. Queues are removed from active processing
  3. Async memory copy callbacks may still be pending
  4. Callbacks never fire because queue is no longer active
  5. async_copy_fini() times out waiting for callbacks (30 seconds)

Solution

Add async_copy_sync() call to QueueController::destroy_queue() to ensure all pending async memory copy operations complete before destroying any queue.

Changes

  • Added #include "lib/rocprofiler-sdk/hsa/async_copy.hpp" to queue_controller.cpp
  • Added async_copy_sync() call before queue destruction in QueueController::destroy_queue()

This minimal 2-line change ensures callbacks are delivered before queue teardown, preventing the 30-second timeout.

Copilot AI review requested due to automatic review settings October 29, 2025 04:21

This comment was marked as outdated.

Add async_copy_sync() call to QueueController::destroy_queue() to ensure
all pending async memory copy operations complete before destroying the
queue. This prevents the 30-second timeout that occurs when callbacks
are waiting on a destroyed queue.

The fix ensures callbacks are delivered before queue teardown by syncing
all async copies globally before individual queue destruction.
@bwelton bwelton force-pushed the bewelton/rocprof-ci-testing branch from 52f6332 to d563548 Compare October 30, 2025 21:35
@bwelton bwelton changed the title Add debug instrumentation to rocprofiler-sdk CI test failures Fix rocprofiler-sdk async copy timeout on queue destruction Oct 30, 2025
When registration finalization has already started, the early return
path in async_copy_handler was not decrementing the active signals
counter before deleting the copy data. This causes async_copy_sync()
to timeout waiting for callbacks that will never fire.

Add fetch_sub(1) before deletion to properly maintain the signal counter.
Move async_copy_sync() call after queue->sync() to ensure queue operations
complete before waiting for async copy callbacks. This prevents race conditions
where callbacks may be delivered after the queue sync but before the async
copy sync completes.
This commit implements a more granular synchronization mechanism for
async memory copy operations to prevent premature signal destruction
while async handlers are still executing.

Changes:
- Add pending_signal_registry to track signals with active async handlers
- Register each signal before installing its async handler callback
- Unregister signals when handlers complete
- Intercept hsa_signal_destroy to wait for pending handlers
- Remove global async_copy_sync call from queue destruction

The new approach eliminates the race condition where signals could be
destroyed before their completion callbacks were delivered, which was
causing timeout issues in CI tests.

Implementation details:
- pending_signal_registry uses Synchronized<> for thread-safe access
- Each async_copy_data includes a completion flag for wait synchronization
- hsa_signal_destroy interception added in hsa_api_impl::functor
- Signal registration happens BEFORE hsa_amd_signal_async_handler_fn
  to avoid race where handler completes before registration
}

// Check if signal has pending handler and wait if needed
void wait_for_signal(hsa_signal_t signal)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const

…race

Store HSA function pointers in async_copy_data structure at registration
time instead of looking them up from tables during handler execution.
This prevents null pointer crashes when handlers are invoked during
shutdown after static table objects have been destroyed.

Changes:
- Add 4 function pointer fields to async_copy_data
- Capture function pointers at handler registration time
- Use captured pointers in async_copy_handler instead of table lookups
- Add defensive null check in active_signals::fetch_sub

This eliminates the TOCTOU race condition and ensures handlers can
execute safely even during shutdown sequence.
…tirement

Call async_copy_sync() at the start of async_copy_fini() before setting
the finalization flag. This ensures async copy handlers complete normally
and properly retire their correlation IDs through the standard code path.

Changes:
- Move async_copy_sync() call to beginning of async_copy_fini()
- Set finalization flag after async copies complete
- Reduce grace period from 100ms to 10ms (handlers already completed)

This fixes the kernel-tracing-validate correlation ID mismatch where
3 correlation IDs were not being retired (145958 vs 145961 expected).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants