Skip to content

Conversation

@stephanie-wang
Copy link
Contributor

@stephanie-wang stephanie-wang commented Jun 10, 2025

Why are these changes needed?

Adds integration between the single-controller collective APIs introduced in #53319 and the GPU objects feature prototyped in #52938. Actor collectives created through ray.experimental.collective.create_collective_group will now be automatically used if a task declares a tensor transport other than the default OBJECT_STORE. This also adds support for allocating the torch tensors on the correct device (GPU for NCCL and CPU for GLOO).

See updates in test_gpu_objects.py for examples.

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

stephanie-wang and others added 27 commits May 23, 2025 16:32
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Co-authored-by: Kai-Hsun Chen <[email protected]>
Signed-off-by: Stephanie Wang <[email protected]>
Co-authored-by: Kai-Hsun Chen <[email protected]>
Signed-off-by: Stephanie Wang <[email protected]>
Co-authored-by: Kai-Hsun Chen <[email protected]>
Signed-off-by: Stephanie Wang <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Signed-off-by: Stephanie Wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Comment on lines 501 to 504
def gpu_object_manager(self) -> "ray._private.gpu_object_manager.GPUObjectManager":
if self._gpu_object_manager is None:
from ray._private.gpu_object_manager import GPUObjectManager
self._gpu_object_manager = GPUObjectManager()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why's this made to be lazy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this is to avoid pulling in any dependencies needed by GPUObjectManager that aren't required by ray usually (currently torch).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Would be nice if we came up with a more structured way to quarantine soft dependencies so we don't need lazy imports for first party code. I'll play around with it at some point.



def test_p2p(ray_start_regular):
# TODO(swang): Add tests for mocked NCCL that can run on CPU-only machines.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes please!

Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
@stephanie-wang stephanie-wang enabled auto-merge (squash) June 11, 2025 23:44
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Jun 11, 2025
Signed-off-by: Stephanie wang <[email protected]>
@github-actions github-actions bot disabled auto-merge June 12, 2025 17:39
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!


# Create test tensor
tensor = torch.tensor([1, 2, 3])
gpu_ref = src_actor.echo_cuda.remote(tensor)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is echo_cuda?

if not _TORCH_AVAILABLE:
raise ImportError(
"`tensor_transport` requires PyTorch. "
"Please install torch with 'pip install torch' to use this feature."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my memory, pip install torch will install CPU version PyTorch. Maybe we can just ask users to install torch without providing the instruction.

Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie wang <[email protected]>
@stephanie-wang stephanie-wang enabled auto-merge (squash) June 16, 2025 21:37
@stephanie-wang stephanie-wang merged commit 93acaf1 into ray-project:master Jun 16, 2025
5 of 6 checks passed
@stephanie-wang stephanie-wang deleted the gpu-object-collective-integration branch June 17, 2025 01:16
elliot-barn pushed a commit that referenced this pull request Jun 18, 2025
…GPU objects (#53720)

Adds integration between the single-controller collective APIs
introduced in #53319 and the GPU objects feature prototyped in #52938.
Actor collectives created through
`ray.experimental.collective.create_collective_group` will now be
automatically used if a task declares a tensor transport other than the
default OBJECT_STORE. This also adds support for allocating the torch
tensors on the correct device (GPU for NCCL and CPU for GLOO).

See updates in test_gpu_objects.py for examples.
---------

Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie Wang <[email protected]>
Co-authored-by: Kai-Hsun Chen <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Signed-off-by: elliot-barn <[email protected]>
minerharry pushed a commit to minerharry/ray that referenced this pull request Jun 27, 2025
…GPU objects (ray-project#53720)

Adds integration between the single-controller collective APIs
introduced in ray-project#53319 and the GPU objects feature prototyped in ray-project#52938.
Actor collectives created through
`ray.experimental.collective.create_collective_group` will now be
automatically used if a task declares a tensor transport other than the
default OBJECT_STORE. This also adds support for allocating the torch
tensors on the correct device (GPU for NCCL and CPU for GLOO).

See updates in test_gpu_objects.py for examples.
---------

Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie Wang <[email protected]>
Co-authored-by: Kai-Hsun Chen <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
elliot-barn pushed a commit that referenced this pull request Jul 2, 2025
…GPU objects (#53720)

Adds integration between the single-controller collective APIs
introduced in #53319 and the GPU objects feature prototyped in #52938.
Actor collectives created through
`ray.experimental.collective.create_collective_group` will now be
automatically used if a task declares a tensor transport other than the
default OBJECT_STORE. This also adds support for allocating the torch
tensors on the correct device (GPU for NCCL and CPU for GLOO).

See updates in test_gpu_objects.py for examples.
---------

Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie Wang <[email protected]>
Co-authored-by: Kai-Hsun Chen <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Signed-off-by: elliot-barn <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants