-
Notifications
You must be signed in to change notification settings - Fork 143
Moving MG functions into unified API + raft::device_resources_snmg as device resource type for MG functions
#454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Moving MG functions into unified API + raft::device_resources_snmg as device resource type for MG functions
#454
Conversation
cpp/include/cuvs/neighbors/mg.hpp
Outdated
| * @return the constructed IVF-Flat MG index | ||
| */ | ||
| auto build(const raft::device_resources& handle, | ||
| auto build(const raft::device_resources_snmg& clique, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the main goals with having the "snmg" resources object match the single-gpu object is that we wanted to be able to remove this additional MG. We should now be able to accept device_resources and then check to see if a nccl clique has been set on it (which would imply that it's a multi-gpu resources object and not a single-gpu object). The whole goal with doing this was to consolidate code paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NCCL clique is not set as a resource anymore, but we should still be able to implement the dispatching by checking the dynamic type of the device_resources. The real question then is, do we truly want dispatching on both the regular API (cuvs::neighbors::build) and the mg namespace (cuvs::neighbors::mg::build)? It kind of make sense that a user providing a device_resources_snmg instance to the regular API (cuvs::neighbors::build) would want things to be deployed on multiple GPUs. However, the reverse is not necessarily true. A user who explicitly chose the mg namespace, but did not provide the adequate device_resources_snmg would fallback to single GPU, potentially unintentionally. Is this what we want?
I propose that we implement the dispatching mechanism solely on the regular API (cuvs::neighbors::build) in a dedicated follow-up PR? This also allows the MG doc to explicitly avert users that they should use an adequate device_resources_snmg to use the MG API. What do you think?
|
Not sure why CI is still failing. The RAFT PR was merged quite awhile ago but I'm still seeing errors like this: |
|
This fell back to old 24.12 RAFT packages for unclear reasons: |
Oh fooey! Thanks for noticing that @bdice, I just realized the darn target branch was never updated! |
6ae1c70 to
b0c7ab9
Compare
raft::device_resources_snmg as device resource type for MG functions
|
/ok to test |
@cjnolet, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
|
/ok to test |
@cjnolet, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
|
/ok to test da48376 |
|
/merge |
|
/ok to test da48 |
|
/ok to test da48376 |
@cjnolet, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/ |
|
/ok to test d2e2be9 |
|
/merge |
92ced86 to
1e1d97e
Compare
|
/ok to test |
@viclafargue, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
|
/ok to test |
@achirkin, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
|
/ok to test 1e1d97e |
|
/ok to test 1e1d97e |
Co-authored-by: Artem M. Chirkin <[email protected]>
|
/ok to test 34d67ee |
|
/merge |
raft::device_resources_snmgas device resource type for MG functionsraft::comms::build_comms_nccl_onlysolving [FEA] Remove dependency ofbuild_comms_nccl_onlyon UCP raft#2465raft::device_resources_snmgis introduced in this PR : rapidsai/raft#2549