[core][gpu objects] Integrate single-controller collective APIs with GPU objects #53720

stephanie-wang · 2025-06-10T22:03:10Z

Why are these changes needed?

Adds integration between the single-controller collective APIs introduced in #53319 and the GPU objects feature prototyped in #52938. Actor collectives created through ray.experimental.collective.create_collective_group will now be automatically used if a task declares a tensor transport other than the default OBJECT_STORE. This also adds support for allocating the torch tensors on the correct device (GPU for NCCL and CPU for GLOO).

See updates in test_gpu_objects.py for examples.

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Stephanie wang <[email protected]>

Co-authored-by: Kai-Hsun Chen <[email protected]> Signed-off-by: Stephanie Wang <[email protected]>

Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: Stephanie Wang <[email protected]>

Signed-off-by: Stephanie wang <[email protected]>

…ct-collective-integration

Signed-off-by: Stephanie wang <[email protected]>

…tive-integration

Signed-off-by: Stephanie wang <[email protected]>

edoakes · 2025-06-11T22:14:26Z

python/ray/_private/worker.py

+    def gpu_object_manager(self) -> "ray._private.gpu_object_manager.GPUObjectManager":
+        if self._gpu_object_manager is None:
+            from ray._private.gpu_object_manager import GPUObjectManager
+            self._gpu_object_manager = GPUObjectManager()


why's this made to be lazy?

Ah this is to avoid pulling in any dependencies needed by GPUObjectManager that aren't required by ray usually (currently torch).

Got it. Would be nice if we came up with a more structured way to quarantine soft dependencies so we don't need lazy imports for first party code. I'll play around with it at some point.

edoakes · 2025-06-11T22:14:54Z

python/ray/tests/test_gpu_objects_nccl.py

+
+
+def test_p2p(ray_start_regular):
+    # TODO(swang): Add tests for mocked NCCL that can run on CPU-only machines.


yes please!

Signed-off-by: Stephanie wang <[email protected]>

…tive-integration

kevin85421

Looks great!

kevin85421 · 2025-06-13T21:54:01Z

python/ray/tests/test_gpu_objects_nccl.py

+
+    # Create test tensor
+    tensor = torch.tensor([1, 2, 3])
+    gpu_ref = src_actor.echo_cuda.remote(tensor)


what is echo_cuda?

kevin85421 · 2025-06-13T22:17:48Z

python/ray/_private/gpu_object_manager.py

+    if not _TORCH_AVAILABLE:
+        raise ImportError(
+            "`tensor_transport` requires PyTorch. "
+            "Please install torch with 'pip install torch' to use this feature."


In my memory, pip install torch will install CPU version PyTorch. Maybe we can just ask users to install torch without providing the instruction.

Signed-off-by: Stephanie wang <[email protected]>

…tive-integration

Signed-off-by: Stephanie wang <[email protected]>

…GPU objects (#53720) Adds integration between the single-controller collective APIs introduced in #53319 and the GPU objects feature prototyped in #52938. Actor collectives created through `ray.experimental.collective.create_collective_group` will now be automatically used if a task declares a tensor transport other than the default OBJECT_STORE. This also adds support for allocating the torch tensors on the correct device (GPU for NCCL and CPU for GLOO). See updates in test_gpu_objects.py for examples. --------- Signed-off-by: Stephanie wang <[email protected]> Signed-off-by: Stephanie Wang <[email protected]> Co-authored-by: Kai-Hsun Chen <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: elliot-barn <[email protected]>

…GPU objects (ray-project#53720) Adds integration between the single-controller collective APIs introduced in ray-project#53319 and the GPU objects feature prototyped in ray-project#52938. Actor collectives created through `ray.experimental.collective.create_collective_group` will now be automatically used if a task declares a tensor transport other than the default OBJECT_STORE. This also adds support for allocating the torch tensors on the correct device (GPU for NCCL and CPU for GLOO). See updates in test_gpu_objects.py for examples. --------- Signed-off-by: Stephanie wang <[email protected]> Signed-off-by: Stephanie Wang <[email protected]> Co-authored-by: Kai-Hsun Chen <[email protected]> Co-authored-by: Edward Oakes <[email protected]>

…GPU objects (#53720) Adds integration between the single-controller collective APIs introduced in #53319 and the GPU objects feature prototyped in #52938. Actor collectives created through `ray.experimental.collective.create_collective_group` will now be automatically used if a task declares a tensor transport other than the default OBJECT_STORE. This also adds support for allocating the torch tensors on the correct device (GPU for NCCL and CPU for GLOO). See updates in test_gpu_objects.py for examples. --------- Signed-off-by: Stephanie wang <[email protected]> Signed-off-by: Stephanie Wang <[email protected]> Co-authored-by: Kai-Hsun Chen <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: elliot-barn <[email protected]>

stephanie-wang and others added 27 commits May 23, 2025 16:32

tmp

db10f80

Signed-off-by: Stephanie wang <[email protected]>

Working basic test

81eeebc

Signed-off-by: Stephanie wang <[email protected]>

tests and group

2f4ea06

Signed-off-by: Stephanie wang <[email protected]>

doc

e31ec81

Signed-off-by: Stephanie wang <[email protected]>

lint

4393996

Signed-off-by: Stephanie wang <[email protected]>

lint

e866300

Signed-off-by: Stephanie wang <[email protected]>

test

541b54a

Signed-off-by: Stephanie wang <[email protected]>

more tests

91da8fc

Signed-off-by: Stephanie wang <[email protected]>

lint

7fcaeac

Signed-off-by: Stephanie wang <[email protected]>

Update python/ray/experimental/collective/util.py

0ce186e

Co-authored-by: Kai-Hsun Chen <[email protected]> Signed-off-by: Stephanie Wang <[email protected]>

Update python/ray/experimental/collective/communicator.py

f909028

Co-authored-by: Kai-Hsun Chen <[email protected]> Signed-off-by: Stephanie Wang <[email protected]>

Update python/ray/experimental/collective/collective.py

0fbf8fb

Co-authored-by: Kai-Hsun Chen <[email protected]> Signed-off-by: Stephanie Wang <[email protected]>

Update python/ray/experimental/collective/collective.py

7cc982c

Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: Stephanie Wang <[email protected]>

cleanup

ee69c3d

Signed-off-by: Stephanie wang <[email protected]>

update

8a84643

Signed-off-by: Stephanie wang <[email protected]>

Merge commit '2ff7298b1a69ea68b0c51a8036acacf147dc8cdf' into gpu-obje…

32b767a

…ct-collective-integration

Unit tests work now

aa7dac9

Signed-off-by: Stephanie wang <[email protected]>

Specify backend

2594dc1

Signed-off-by: Stephanie wang <[email protected]>

Allocate on correct device

681704c

Signed-off-by: Stephanie wang <[email protected]>

GPU test

a72f78c

Signed-off-by: Stephanie wang <[email protected]>

Merge remote-tracking branch 'upstream/master' into gpu-object-collec…

cff9244

…tive-integration

doc

083a3fa

Signed-off-by: Stephanie wang <[email protected]>

more docs

a30ea59

Signed-off-by: Stephanie wang <[email protected]>

lint

909d298

Signed-off-by: Stephanie wang <[email protected]>

doc

ce69966

Signed-off-by: Stephanie wang <[email protected]>

test

edb45ae

Signed-off-by: Stephanie wang <[email protected]>

comment

253cfce

Signed-off-by: Stephanie wang <[email protected]>

stephanie-wang assigned edoakes and kevin85421 Jun 10, 2025

edoakes approved these changes Jun 11, 2025

View reviewed changes

stephanie-wang added 2 commits June 11, 2025 16:40

fix, lint

c4091bf

Signed-off-by: Stephanie wang <[email protected]>

fix and lint

10e6624

Signed-off-by: Stephanie wang <[email protected]>

stephanie-wang enabled auto-merge (squash) June 11, 2025 23:44

github-actions bot added the go add ONLY when ready to merge, run all tests label Jun 11, 2025

lint

309558c

Signed-off-by: Stephanie wang <[email protected]>

github-actions bot disabled auto-merge June 12, 2025 17:39

stephanie-wang added 5 commits June 12, 2025 10:40

lint

5e2ac05

Signed-off-by: Stephanie wang <[email protected]>

avoid torch import

d8f457c

Signed-off-by: Stephanie wang <[email protected]>

lint

816baa5

Signed-off-by: Stephanie wang <[email protected]>

lint

0e751d2

Signed-off-by: Stephanie wang <[email protected]>

Merge remote-tracking branch 'upstream/master' into gpu-object-collec…

88d80be

…tive-integration

kevin85421 mentioned this pull request Jun 13, 2025

How to transfer tensors stored in GPU in actor with NCCL? #53816

Closed

kevin85421 approved these changes Jun 13, 2025

View reviewed changes

stephanie-wang added 5 commits June 13, 2025 15:37

fix imports

4ca5195

Signed-off-by: Stephanie wang <[email protected]>

ignore

42ba38a

Signed-off-by: Stephanie wang <[email protected]>

Merge remote-tracking branch 'upstream/master' into gpu-object-collec…

384fb46

…tive-integration

fix

a7a79cb

Signed-off-by: Stephanie wang <[email protected]>

fix test

2bfccdf

Signed-off-by: Stephanie wang <[email protected]>

stephanie-wang enabled auto-merge (squash) June 16, 2025 21:37

stephanie-wang merged commit 93acaf1 into ray-project:master Jun 16, 2025
5 of 6 checks passed

stephanie-wang deleted the gpu-object-collective-integration branch June 17, 2025 01:16

stephanie-wang mentioned this pull request Jun 17, 2025

[core][gpu-objects] Allocate placeholder tensor on corresponding devices #53622

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[core][gpu objects] Integrate single-controller collective APIs with GPU objects #53720

[core][gpu objects] Integrate single-controller collective APIs with GPU objects #53720

stephanie-wang commented Jun 10, 2025 •

edited

Loading

Uh oh!

edoakes Jun 11, 2025

Uh oh!

stephanie-wang Jun 11, 2025

Uh oh!

edoakes Jun 13, 2025

Uh oh!

edoakes Jun 11, 2025

Uh oh!

kevin85421 left a comment

Uh oh!

kevin85421 Jun 13, 2025

Uh oh!

kevin85421 Jun 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants



		def test_p2p(ray_start_regular):
		# TODO(swang): Add tests for mocked NCCL that can run on CPU-only machines.

[core][gpu objects] Integrate single-controller collective APIs with GPU objects #53720

[core][gpu objects] Integrate single-controller collective APIs with GPU objects #53720

Conversation

stephanie-wang commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Checks

Uh oh!

edoakes Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

stephanie-wang Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

edoakes Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

edoakes Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

kevin85421 left a comment

Choose a reason for hiding this comment

Uh oh!

kevin85421 Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

kevin85421 Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stephanie-wang commented Jun 10, 2025 •

edited

Loading