-
Notifications
You must be signed in to change notification settings - Fork 491
UCP/DEVICE: Remove redundant UCP_EP_FLAG_REMOTE_CONNECTED check #10985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
UCP/DEVICE: Remove redundant UCP_EP_FLAG_REMOTE_CONNECTED check #10985
Conversation
WalkthroughRemoved an early connectivity guard in Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant ucp_device_mem_list_create
participant param_validation
participant deeper_calls
rect rgb(250, 240, 245)
Note over Caller,ucp_device_mem_list_create: Before (old flow)
Caller->>ucp_device_mem_list_create: request
ucp_device_mem_list_create->>ucp_device_mem_list_create: early check: connected?
ucp_device_mem_list_create-->>Caller: UCS_ERR_NOT_CONNECTED (early return)
end
rect rgb(240, 250, 240)
Note over Caller,deeper_calls: After (new flow)
Caller->>ucp_device_mem_list_create: request
ucp_device_mem_list_create->>param_validation: validate parameters
param_validation-->>ucp_device_mem_list_create: OK
ucp_device_mem_list_create->>deeper_calls: call deeper ops
deeper_calls-->>ucp_device_mem_list_create: UCS_ERR_NOT_CONNECTED / OK / other
ucp_device_mem_list_create-->>Caller: Result
end
sequenceDiagram
participant Caller (CUDA / Test)
participant RetryLoop
participant WorkerProgress
participant ucp_device_mem_list_create
Note over Caller,RetryLoop: New caller-side retry behavior
Caller->>RetryLoop: attempt create
RetryLoop->>ucp_device_mem_list_create: create
alt returns UCS_OK
ucp_device_mem_list_create-->>RetryLoop: UCS_OK
RetryLoop-->>Caller: success
else returns UCS_ERR_NOT_CONNECTED
ucp_device_mem_list_create-->>RetryLoop: UCS_ERR_NOT_CONNECTED
RetryLoop->>WorkerProgress: call progress (sender/receiver or worker)
WorkerProgress-->>RetryLoop: progressed
RetryLoop->>ucp_device_mem_list_create: retry create
else returns other error
ucp_device_mem_list_create-->>RetryLoop: error
RetryLoop-->>Caller: fail / throw
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20–30 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/tools/perf/cuda/ucp_cuda_kernel.cu (1)
155-166: Retry logic correctly handles transient NOT_CONNECTED states.The do-while loop appropriately handles
UCS_ERR_NOT_CONNECTEDby progressing the worker and retrying until the connection is established or a different error occurs.Consider including the status string in the error message for better diagnostics:
if (status != UCS_OK) { - throw std::runtime_error("Failed to create memory list"); + throw std::runtime_error(std::string("Failed to create memory list: ") + + ucs_status_string(status)); }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/tools/perf/cuda/ucp_cuda_kernel.cu(1 hunks)test/gtest/ucp/test_ucp_device.cc(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
test/gtest/ucp/test_ucp_device.cc (1)
src/ucp/core/ucp_device.c (1)
ucp_device_mem_list_create(537-607)
src/tools/perf/cuda/ucp_cuda_kernel.cu (2)
src/ucp/core/ucp_device.c (1)
ucp_device_mem_list_create(537-607)src/ucp/core/ucp_worker.c (1)
ucp_worker_progress(3060-3080)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: UCX PR (Static_check Static checks)
- GitHub Check: UCX PR (Codestyle ctags check)
- GitHub Check: UCX PR (Codestyle AUTHORS file update check)
- GitHub Check: UCX PR (Codestyle codespell check)
- GitHub Check: UCX PR (Codestyle format code)
- GitHub Check: UCX PR (Codestyle commit title)
- GitHub Check: UCX release DRP (Prepare CheckRelease)
- GitHub Check: UCX release (Prepare CheckRelease)
- GitHub Check: UCX snapshot (Prepare Check)
🔇 Additional comments (1)
test/gtest/ucp/test_ucp_device.cc (1)
147-157: Retry logic correctly handles transient NOT_CONNECTED states in test context.The do-while loop appropriately handles
UCS_ERR_NOT_CONNECTEDby progressing both sender and receiver entities before retrying. The early break optimization and the descriptive comment make the intent clear.
| sender.progress(); | ||
| receiver.progress(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can use ucp_test::progress()
| ucs_status_t status; | ||
| do { | ||
| status = ucp_device_mem_list_create(sender.ep(), ¶ms, &m_mem_list_h); | ||
| if (status != UCS_ERR_NOT_CONNECTED) { | ||
| break; | ||
| } | ||
| sender.progress(); | ||
| receiver.progress(); | ||
| } while (status == UCS_ERR_NOT_CONNECTED); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ucs_status_t status; | |
| do { | |
| status = ucp_device_mem_list_create(sender.ep(), ¶ms, &m_mem_list_h); | |
| if (status != UCS_ERR_NOT_CONNECTED) { | |
| break; | |
| } | |
| sender.progress(); | |
| receiver.progress(); | |
| } while (status == UCS_ERR_NOT_CONNECTED); | |
| ucs_status_t status; | |
| do { | |
| progress(); | |
| status = ucp_device_mem_list_create(sender.ep(), ¶ms, &m_mem_list_h); | |
| } while (status == UCS_ERR_NOT_CONNECTED); |
| ucs_status_t status; | ||
| do { | ||
| status = ucp_device_mem_list_create(perf.ucp.ep, ¶ms, | ||
| &m_params.mem_list); | ||
| if (status == UCS_ERR_NOT_CONNECTED) { | ||
| ucp_worker_progress(perf.ucp.worker); | ||
| } | ||
| } while (status == UCS_ERR_NOT_CONNECTED); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ucs_status_t status; | |
| do { | |
| status = ucp_device_mem_list_create(perf.ucp.ep, ¶ms, | |
| &m_params.mem_list); | |
| if (status == UCS_ERR_NOT_CONNECTED) { | |
| ucp_worker_progress(perf.ucp.worker); | |
| } | |
| } while (status == UCS_ERR_NOT_CONNECTED); | |
| ucs_status_t status; | |
| do { | |
| ucp_worker_progress(perf.ucp.worker); | |
| status = ucp_device_mem_list_create(perf.ucp.ep, ¶ms, | |
| &m_params.mem_list); | |
| } while (status == UCS_ERR_NOT_CONNECTED); |
What?
Remove redundant
UCP_EP_FLAG_REMOTE_CONNECTEDcheck fromucp_device_mem_list_create().Why?
The check is redundant with the more accurate
ucp_wireup_ep_test()performed later inucp_device_mem_list_create_handle(). Additionally, the endpoint flag check is not appropriate for cuda_ipc where lanes may be ready without the full endpoint being marked as remote connected.How?
The per-lane
ucp_wireup_ep_test()check at line 382 is sufficient to detect when lanes are not ready, making the early endpoint flag check unnecessary.Summary by CodeRabbit