Skip to content

Conversation

@michal-shalev
Copy link
Contributor

@michal-shalev michal-shalev commented Nov 3, 2025

What?

Remove redundant UCP_EP_FLAG_REMOTE_CONNECTED check from ucp_device_mem_list_create().

Why?

The check is redundant with the more accurate ucp_wireup_ep_test() performed later in ucp_device_mem_list_create_handle(). Additionally, the endpoint flag check is not appropriate for cuda_ipc where lanes may be ready without the full endpoint being marked as remote connected.

How?

The per-lane ucp_wireup_ep_test() check at line 382 is sufficient to detect when lanes are not ready, making the early endpoint flag check unnecessary.

Summary by CodeRabbit

  • Bug Fixes
    • Improved handling of transient "not connected" states so memory operations proceed through validation and retry rather than failing immediately, reducing spurious connectivity errors and improving robustness during connection setup.
  • Tests
    • Updated tests to use a retry/progress approach for connection-dependent operations to better reflect real-world transient connectivity behavior.

@coderabbitai
Copy link

coderabbitai bot commented Nov 4, 2025

Walkthrough

Removed an early connectivity guard in ucp_device_mem_list_create; moved parameter validation earlier. Added retry/progress loops where callers previously asserted on UCS_ERR_NOT_CONNECTED, specifically in a CUDA kernel and a unit test, to retry creation until success or a non-NC error.

Changes

Cohort / File(s) Summary
Core: early-guard removal
src/ucp/core/ucp_device.c
Removed early UCS_ERR_NOT_CONNECTED guard from ucp_device_mem_list_create, so the function proceeds to parameter validation and deeper calls without an early connectivity return.
CUDA: retry loop on NC
src/tools/perf/cuda/ucp_cuda_kernel.cu
Replaced single-call ucp_device_mem_list_create with a retry loop that calls worker progress on UCS_ERR_NOT_CONNECTED and retries until success or a different error; throws runtime_error if final status != UCS_OK.
Tests: retry in mem_list construction
test/gtest/ucp/test_ucp_device.cc
Removed separate pre-connection/send-receive wait; changed mem_list construction to retry ucp_device_mem_list_create while calling sender.progress() and receiver.progress() on UCS_ERR_NOT_CONNECTED until non-NC status returned.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant ucp_device_mem_list_create
    participant param_validation
    participant deeper_calls

    rect rgb(250, 240, 245)
    Note over Caller,ucp_device_mem_list_create: Before (old flow)
    Caller->>ucp_device_mem_list_create: request
    ucp_device_mem_list_create->>ucp_device_mem_list_create: early check: connected?
    ucp_device_mem_list_create-->>Caller: UCS_ERR_NOT_CONNECTED (early return)
    end

    rect rgb(240, 250, 240)
    Note over Caller,deeper_calls: After (new flow)
    Caller->>ucp_device_mem_list_create: request
    ucp_device_mem_list_create->>param_validation: validate parameters
    param_validation-->>ucp_device_mem_list_create: OK
    ucp_device_mem_list_create->>deeper_calls: call deeper ops
    deeper_calls-->>ucp_device_mem_list_create: UCS_ERR_NOT_CONNECTED / OK / other
    ucp_device_mem_list_create-->>Caller: Result
    end
Loading
sequenceDiagram
    participant Caller (CUDA / Test)
    participant RetryLoop
    participant WorkerProgress
    participant ucp_device_mem_list_create

    Note over Caller,RetryLoop: New caller-side retry behavior
    Caller->>RetryLoop: attempt create
    RetryLoop->>ucp_device_mem_list_create: create
    alt returns UCS_OK
        ucp_device_mem_list_create-->>RetryLoop: UCS_OK
        RetryLoop-->>Caller: success
    else returns UCS_ERR_NOT_CONNECTED
        ucp_device_mem_list_create-->>RetryLoop: UCS_ERR_NOT_CONNECTED
        RetryLoop->>WorkerProgress: call progress (sender/receiver or worker)
        WorkerProgress-->>RetryLoop: progressed
        RetryLoop->>ucp_device_mem_list_create: retry create
    else returns other error
        ucp_device_mem_list_create-->>RetryLoop: error
        RetryLoop-->>Caller: fail / throw
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20–30 minutes

  • Review src/ucp/core/ucp_device.c to ensure deferred connectivity checks don't invalidate parameter validation assumptions.
  • Inspect retry logic in src/tools/perf/cuda/ucp_cuda_kernel.cu for potential busy-looping and correct progress invocation/context.
  • Verify test changes in test/gtest/ucp/test_ucp_device.cc correctly simulate transient not-connected states and do not mask real failures.

Poem

I’m a rabbit in code, nibbling guards away,
Validation first, then deeper play.
Retry hops in loops, progress makes it right,
From early stops to patient flight. 🐇✨

Pre-merge checks and finishing touches

✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: removing a redundant UCP_EP_FLAG_REMOTE_CONNECTED check from the ucp_device_mem_list_create function, which is the primary modification across all three files.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/tools/perf/cuda/ucp_cuda_kernel.cu (1)

155-166: Retry logic correctly handles transient NOT_CONNECTED states.

The do-while loop appropriately handles UCS_ERR_NOT_CONNECTED by progressing the worker and retrying until the connection is established or a different error occurs.

Consider including the status string in the error message for better diagnostics:

 if (status != UCS_OK) {
-    throw std::runtime_error("Failed to create memory list");
+    throw std::runtime_error(std::string("Failed to create memory list: ") + 
+                             ucs_status_string(status));
 }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 00dcd40 and 1052c73.

📒 Files selected for processing (2)
  • src/tools/perf/cuda/ucp_cuda_kernel.cu (1 hunks)
  • test/gtest/ucp/test_ucp_device.cc (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
test/gtest/ucp/test_ucp_device.cc (1)
src/ucp/core/ucp_device.c (1)
  • ucp_device_mem_list_create (537-607)
src/tools/perf/cuda/ucp_cuda_kernel.cu (2)
src/ucp/core/ucp_device.c (1)
  • ucp_device_mem_list_create (537-607)
src/ucp/core/ucp_worker.c (1)
  • ucp_worker_progress (3060-3080)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: UCX PR (Static_check Static checks)
  • GitHub Check: UCX PR (Codestyle ctags check)
  • GitHub Check: UCX PR (Codestyle AUTHORS file update check)
  • GitHub Check: UCX PR (Codestyle codespell check)
  • GitHub Check: UCX PR (Codestyle format code)
  • GitHub Check: UCX PR (Codestyle commit title)
  • GitHub Check: UCX release DRP (Prepare CheckRelease)
  • GitHub Check: UCX release (Prepare CheckRelease)
  • GitHub Check: UCX snapshot (Prepare Check)
🔇 Additional comments (1)
test/gtest/ucp/test_ucp_device.cc (1)

147-157: Retry logic correctly handles transient NOT_CONNECTED states in test context.

The do-while loop appropriately handles UCS_ERR_NOT_CONNECTED by progressing both sender and receiver entities before retrying. The early break optimization and the descriptive comment make the intent clear.

Comment on lines +154 to +155
sender.progress();
receiver.progress();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can use ucp_test::progress()

Comment on lines +148 to +156
ucs_status_t status;
do {
status = ucp_device_mem_list_create(sender.ep(), &params, &m_mem_list_h);
if (status != UCS_ERR_NOT_CONNECTED) {
break;
}
sender.progress();
receiver.progress();
} while (status == UCS_ERR_NOT_CONNECTED);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ucs_status_t status;
do {
status = ucp_device_mem_list_create(sender.ep(), &params, &m_mem_list_h);
if (status != UCS_ERR_NOT_CONNECTED) {
break;
}
sender.progress();
receiver.progress();
} while (status == UCS_ERR_NOT_CONNECTED);
ucs_status_t status;
do {
progress();
status = ucp_device_mem_list_create(sender.ep(), &params, &m_mem_list_h);
} while (status == UCS_ERR_NOT_CONNECTED);

Comment on lines +155 to +162
ucs_status_t status;
do {
status = ucp_device_mem_list_create(perf.ucp.ep, &params,
&m_params.mem_list);
if (status == UCS_ERR_NOT_CONNECTED) {
ucp_worker_progress(perf.ucp.worker);
}
} while (status == UCS_ERR_NOT_CONNECTED);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ucs_status_t status;
do {
status = ucp_device_mem_list_create(perf.ucp.ep, &params,
&m_params.mem_list);
if (status == UCS_ERR_NOT_CONNECTED) {
ucp_worker_progress(perf.ucp.worker);
}
} while (status == UCS_ERR_NOT_CONNECTED);
ucs_status_t status;
do {
ucp_worker_progress(perf.ucp.worker);
status = ucp_device_mem_list_create(perf.ucp.ep, &params,
&m_params.mem_list);
} while (status == UCS_ERR_NOT_CONNECTED);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants