-
Notifications
You must be signed in to change notification settings - Fork 7k
[core][gpu objects] Integrate single-controller collective APIs with GPU objects #53720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core][gpu objects] Integrate single-controller collective APIs with GPU objects #53720
Conversation
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Co-authored-by: Kai-Hsun Chen <[email protected]> Signed-off-by: Stephanie Wang <[email protected]>
Co-authored-by: Kai-Hsun Chen <[email protected]> Signed-off-by: Stephanie Wang <[email protected]>
Co-authored-by: Kai-Hsun Chen <[email protected]> Signed-off-by: Stephanie Wang <[email protected]>
Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: Stephanie Wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
…ct-collective-integration
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
| def gpu_object_manager(self) -> "ray._private.gpu_object_manager.GPUObjectManager": | ||
| if self._gpu_object_manager is None: | ||
| from ray._private.gpu_object_manager import GPUObjectManager | ||
| self._gpu_object_manager = GPUObjectManager() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why's this made to be lazy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah this is to avoid pulling in any dependencies needed by GPUObjectManager that aren't required by ray usually (currently torch).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Would be nice if we came up with a more structured way to quarantine soft dependencies so we don't need lazy imports for first party code. I'll play around with it at some point.
|
|
||
|
|
||
| def test_p2p(ray_start_regular): | ||
| # TODO(swang): Add tests for mocked NCCL that can run on CPU-only machines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes please!
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
kevin85421
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
|
|
||
| # Create test tensor | ||
| tensor = torch.tensor([1, 2, 3]) | ||
| gpu_ref = src_actor.echo_cuda.remote(tensor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is echo_cuda?
| if not _TORCH_AVAILABLE: | ||
| raise ImportError( | ||
| "`tensor_transport` requires PyTorch. " | ||
| "Please install torch with 'pip install torch' to use this feature." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my memory, pip install torch will install CPU version PyTorch. Maybe we can just ask users to install torch without providing the instruction.
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
…GPU objects (#53720) Adds integration between the single-controller collective APIs introduced in #53319 and the GPU objects feature prototyped in #52938. Actor collectives created through `ray.experimental.collective.create_collective_group` will now be automatically used if a task declares a tensor transport other than the default OBJECT_STORE. This also adds support for allocating the torch tensors on the correct device (GPU for NCCL and CPU for GLOO). See updates in test_gpu_objects.py for examples. --------- Signed-off-by: Stephanie wang <[email protected]> Signed-off-by: Stephanie Wang <[email protected]> Co-authored-by: Kai-Hsun Chen <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: elliot-barn <[email protected]>
…GPU objects (ray-project#53720) Adds integration between the single-controller collective APIs introduced in ray-project#53319 and the GPU objects feature prototyped in ray-project#52938. Actor collectives created through `ray.experimental.collective.create_collective_group` will now be automatically used if a task declares a tensor transport other than the default OBJECT_STORE. This also adds support for allocating the torch tensors on the correct device (GPU for NCCL and CPU for GLOO). See updates in test_gpu_objects.py for examples. --------- Signed-off-by: Stephanie wang <[email protected]> Signed-off-by: Stephanie Wang <[email protected]> Co-authored-by: Kai-Hsun Chen <[email protected]> Co-authored-by: Edward Oakes <[email protected]>
…GPU objects (#53720) Adds integration between the single-controller collective APIs introduced in #53319 and the GPU objects feature prototyped in #52938. Actor collectives created through `ray.experimental.collective.create_collective_group` will now be automatically used if a task declares a tensor transport other than the default OBJECT_STORE. This also adds support for allocating the torch tensors on the correct device (GPU for NCCL and CPU for GLOO). See updates in test_gpu_objects.py for examples. --------- Signed-off-by: Stephanie wang <[email protected]> Signed-off-by: Stephanie Wang <[email protected]> Co-authored-by: Kai-Hsun Chen <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: elliot-barn <[email protected]>
Why are these changes needed?
Adds integration between the single-controller collective APIs introduced in #53319 and the GPU objects feature prototyped in #52938. Actor collectives created through
ray.experimental.collective.create_collective_groupwill now be automatically used if a task declares a tensor transport other than the default OBJECT_STORE. This also adds support for allocating the torch tensors on the correct device (GPU for NCCL and CPU for GLOO).See updates in test_gpu_objects.py for examples.
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.