Skip to content

Conversation

@kevin85421
Copy link
Member

@kevin85421 kevin85421 commented May 12, 2025

Why are these changes needed?

High-level

image

This PR implements most parts of the diagram above.

  • Currently, this PR supports only Tensors, not TensorDicts.
  • This PR supports data that mixes both CPU and GPU.
  • This PR implements a very simple GC mechanism: all data in the in-actor object store is garbage-collected once it has been consumed.

Details

  • Step 1: Users annotate the sender’s actor method with @ray.method(tensor_transport=...). The valid, case-insensitive values for tensor_transport are nccl, gloo, and object_store (default).

    @ray.remote
    class GPUTestActor:
        @ray.method(tensor_transport="GLOO")
        def echo(self, data):
            return data
        ...
  • Step 2: Users create a communication group, such as an NCCL group, for actors that need to communicate with each other. In addition, each actor needs to register the custom serializer.

    • This step can be optimized for better UX in the future.
  • Step 3: Pass the tensor_transport information through the stack—Python → Cython → C++ → Cython —when submitting a task to the sender actor.

    • Python (actor.py) → Cython → C++ → Cython
    • If tensor_transport is not OBJECT_STORE, serialize_and_store_gpu_objects will be called to extract tensors from the task output and store them in the GPUObjectManager.
  • Step 4: When the driver process resolves the dependencies of the receiver actor’s task argument, if that argument is an ObjectRef pointing to an object created by an actor method annotated with @ray.method(tensor_transport="...") (NCCL or GLOO), it submits a __ray_send__ task to the sender actor to initiate the send operation (e.g., NCCL send) and a __ray_recv__ task to the receiver actor to initiate the receive operation (e.g., NCCL recv).

  • Step 5: Pass the object ID through the stack—C++ (driver) → C++ (receiver actor) → Cython → Python (def deserialize).

    • In step 4, the data’s tensors are transferred via out-of-band NCCL or GLOO communication, then stored in the receiver actor’s in-actor object store using the object ID as the key.
    • In def deserialize, use the object ID to retrieve tensors from the in-actor object store, add them to the serialization context, and then deserialize to obtain the argument.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

stephanie-wang and others added 14 commits May 1, 2025 07:45
Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Kai-Hsun Chen <[email protected]>
Signed-off-by: Kai-Hsun Chen <[email protected]>
Signed-off-by: Kai-Hsun Chen <[email protected]>
Signed-off-by: Kai-Hsun Chen <[email protected]>
Signed-off-by: Kai-Hsun Chen <[email protected]>
Signed-off-by: Kai-Hsun Chen <[email protected]>
Signed-off-by: Kai-Hsun Chen <[email protected]>
Signed-off-by: Kai-Hsun Chen <[email protected]>
Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good so far!

kevin85421 added 15 commits May 16, 2025 05:59
Signed-off-by: Kai-Hsun Chen <[email protected]>
Signed-off-by: Kai-Hsun Chen <[email protected]>
Signed-off-by: Kai-Hsun Chen <[email protected]>
Signed-off-by: Kai-Hsun Chen <[email protected]>
Signed-off-by: Kai-Hsun Chen <[email protected]>
Signed-off-by: Kai-Hsun Chen <[email protected]>
Signed-off-by: Kai-Hsun Chen <[email protected]>
Signed-off-by: Kai-Hsun Chen <[email protected]>
Signed-off-by: Kai-Hsun Chen <[email protected]>
Signed-off-by: Kai-Hsun Chen <[email protected]>
Signed-off-by: Kai-Hsun Chen <[email protected]>
Signed-off-by: Kai-Hsun Chen <[email protected]>
@kevin85421
Copy link
Member Author

Hi @edoakes, I’ve updated the type from a string to a Protobuf enum. Would you have a chance to take another look? I’ll be adding some tests today.

Signed-off-by: Kai-Hsun Chen <[email protected]>
@kevin85421
Copy link
Member Author

I added a test to cover both non-trivial public functions in GPUObjectManager.

2884f4f

@kevin85421
Copy link
Member Author

I decide not to rebase to fix the sign off issue because:

image

@stephanie-wang stephanie-wang merged commit 2ff7298 into ray-project:master May 29, 2025
4 of 5 checks passed
kevin85421 added a commit to kevin85421/ray that referenced this pull request Jun 6, 2025
jjyao pushed a commit that referenced this pull request Jun 6, 2025
@kevin85421 kevin85421 added the rdt Ray Direct Transport label Jun 15, 2025
stephanie-wang added a commit that referenced this pull request Jun 16, 2025
…GPU objects (#53720)

Adds integration between the single-controller collective APIs
introduced in #53319 and the GPU objects feature prototyped in #52938.
Actor collectives created through
`ray.experimental.collective.create_collective_group` will now be
automatically used if a task declares a tensor transport other than the
default OBJECT_STORE. This also adds support for allocating the torch
tensors on the correct device (GPU for NCCL and CPU for GLOO).

See updates in test_gpu_objects.py for examples.
---------

Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie Wang <[email protected]>
Co-authored-by: Kai-Hsun Chen <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
elliot-barn pushed a commit that referenced this pull request Jun 18, 2025
…GPU objects (#53720)

Adds integration between the single-controller collective APIs
introduced in #53319 and the GPU objects feature prototyped in #52938.
Actor collectives created through
`ray.experimental.collective.create_collective_group` will now be
automatically used if a task declares a tensor transport other than the
default OBJECT_STORE. This also adds support for allocating the torch
tensors on the correct device (GPU for NCCL and CPU for GLOO).

See updates in test_gpu_objects.py for examples.
---------

Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie Wang <[email protected]>
Co-authored-by: Kai-Hsun Chen <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Signed-off-by: elliot-barn <[email protected]>
minerharry pushed a commit to minerharry/ray that referenced this pull request Jun 27, 2025
…GPU objects (ray-project#53720)

Adds integration between the single-controller collective APIs
introduced in ray-project#53319 and the GPU objects feature prototyped in ray-project#52938.
Actor collectives created through
`ray.experimental.collective.create_collective_group` will now be
automatically used if a task declares a tensor transport other than the
default OBJECT_STORE. This also adds support for allocating the torch
tensors on the correct device (GPU for NCCL and CPU for GLOO).

See updates in test_gpu_objects.py for examples.
---------

Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie Wang <[email protected]>
Co-authored-by: Kai-Hsun Chen <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
elliot-barn pushed a commit that referenced this pull request Jul 2, 2025
…GPU objects (#53720)

Adds integration between the single-controller collective APIs
introduced in #53319 and the GPU objects feature prototyped in #52938.
Actor collectives created through
`ray.experimental.collective.create_collective_group` will now be
automatically used if a task declares a tensor transport other than the
default OBJECT_STORE. This also adds support for allocating the torch
tensors on the correct device (GPU for NCCL and CPU for GLOO).

See updates in test_gpu_objects.py for examples.
---------

Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie Wang <[email protected]>
Co-authored-by: Kai-Hsun Chen <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Signed-off-by: elliot-barn <[email protected]>
weiquanlee pushed a commit to antgroup/ant-ray that referenced this pull request Aug 5, 2025
@@ -0,0 +1,153 @@
import sys
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment for linking

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests rdt Ray Direct Transport

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants