[core][gpu-objects] GPU Objects POC #52938

kevin85421 · 2025-05-12T08:02:32Z

Why are these changes needed?

High-level

This PR implements most parts of the diagram above.

Currently, this PR supports only Tensors, not TensorDicts.
This PR supports data that mixes both CPU and GPU.
This PR implements a very simple GC mechanism: all data in the in-actor object store is garbage-collected once it has been consumed.

Details

Step 1: Users annotate the sender’s actor method with @ray.method(tensor_transport=...). The valid, case-insensitive values for tensor_transport are nccl, gloo, and object_store (default).
```
@ray.remote
class GPUTestActor:
    @ray.method(tensor_transport="GLOO")
    def echo(self, data):
        return data
    ...
```
Step 2: Users create a communication group, such as an NCCL group, for actors that need to communicate with each other. In addition, each actor needs to register the custom serializer.
- This step can be optimized for better UX in the future.
Step 3: Pass the tensor_transport information through the stack—Python → Cython → C++ → Cython —when submitting a task to the sender actor.
- Python (actor.py) → Cython → C++ → Cython
- If tensor_transport is not OBJECT_STORE, serialize_and_store_gpu_objects will be called to extract tensors from the task output and store them in the GPUObjectManager.
Step 4: When the driver process resolves the dependencies of the receiver actor’s task argument, if that argument is an ObjectRef pointing to an object created by an actor method annotated with @ray.method(tensor_transport="...") (NCCL or GLOO), it submits a __ray_send__ task to the sender actor to initiate the send operation (e.g., NCCL send) and a __ray_recv__ task to the receiver actor to initiate the receive operation (e.g., NCCL recv).
Step 5: Pass the object ID through the stack—C++ (driver) → C++ (receiver actor) → Cython → Python (def deserialize).
- In step 4, the data’s tensors are transferred via out-of-band NCCL or GLOO communication, then stored in the receiver actor’s in-actor object store using the object ID as the key.
- In def deserialize, use the object ID to retrieve tensors from the in-actor object store, add them to the serialization context, and then deserialize to obtain the argument.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Stephanie wang <[email protected]>

Signed-off-by: Kai-Hsun Chen <[email protected]>

This reverts commit 2d166c2.

Signed-off-by: Kai-Hsun Chen <[email protected]>

stephanie-wang

Looking good so far!

cpp/src/ray/runtime/task/task_executor.cc

python/ray/actor.py

src/ray/common/task/task_spec.cc

…ython

Signed-off-by: Kai-Hsun Chen <[email protected]>

…sion Signed-off-by: Kai-Hsun Chen <[email protected]>

Signed-off-by: Kai-Hsun Chen <[email protected]>

kevin85421 · 2025-05-28T22:23:47Z

Hi @edoakes, I’ve updated the type from a string to a Protobuf enum. Would you have a chance to take another look? I’ll be adding some tests today.

Signed-off-by: Kai-Hsun Chen <[email protected]>

kevin85421 · 2025-05-29T00:42:06Z

I added a test to cover both non-trivial public functions in GPUObjectManager.

2884f4f

kevin85421 · 2025-05-29T00:54:17Z

I decide not to rebase to fix the sign off issue because:

This reverts commit 2ff7298.

…GPU objects (#53720) Adds integration between the single-controller collective APIs introduced in #53319 and the GPU objects feature prototyped in #52938. Actor collectives created through `ray.experimental.collective.create_collective_group` will now be automatically used if a task declares a tensor transport other than the default OBJECT_STORE. This also adds support for allocating the torch tensors on the correct device (GPU for NCCL and CPU for GLOO). See updates in test_gpu_objects.py for examples. --------- Signed-off-by: Stephanie wang <[email protected]> Signed-off-by: Stephanie Wang <[email protected]> Co-authored-by: Kai-Hsun Chen <[email protected]> Co-authored-by: Edward Oakes <[email protected]>

…GPU objects (#53720) Adds integration between the single-controller collective APIs introduced in #53319 and the GPU objects feature prototyped in #52938. Actor collectives created through `ray.experimental.collective.create_collective_group` will now be automatically used if a task declares a tensor transport other than the default OBJECT_STORE. This also adds support for allocating the torch tensors on the correct device (GPU for NCCL and CPU for GLOO). See updates in test_gpu_objects.py for examples. --------- Signed-off-by: Stephanie wang <[email protected]> Signed-off-by: Stephanie Wang <[email protected]> Co-authored-by: Kai-Hsun Chen <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: elliot-barn <[email protected]>

…GPU objects (ray-project#53720) Adds integration between the single-controller collective APIs introduced in ray-project#53319 and the GPU objects feature prototyped in ray-project#52938. Actor collectives created through `ray.experimental.collective.create_collective_group` will now be automatically used if a task declares a tensor transport other than the default OBJECT_STORE. This also adds support for allocating the torch tensors on the correct device (GPU for NCCL and CPU for GLOO). See updates in test_gpu_objects.py for examples. --------- Signed-off-by: Stephanie wang <[email protected]> Signed-off-by: Stephanie Wang <[email protected]> Co-authored-by: Kai-Hsun Chen <[email protected]> Co-authored-by: Edward Oakes <[email protected]>

…GPU objects (#53720) Adds integration between the single-controller collective APIs introduced in #53319 and the GPU objects feature prototyped in #52938. Actor collectives created through `ray.experimental.collective.create_collective_group` will now be automatically used if a task declares a tensor transport other than the default OBJECT_STORE. This also adds support for allocating the torch tensors on the correct device (GPU for NCCL and CPU for GLOO). See updates in test_gpu_objects.py for examples. --------- Signed-off-by: Stephanie wang <[email protected]> Signed-off-by: Stephanie Wang <[email protected]> Co-authored-by: Kai-Hsun Chen <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: elliot-barn <[email protected]>

…-project#53602) This reverts commit 2ff7298.

crypdick · 2025-08-14T02:37:24Z

python/ray/tests/test_gpu_objects.py

@@ -0,0 +1,153 @@
+import sys


comment for linking

stephanie-wang and others added 14 commits May 1, 2025 07:45

prototype works

340b5d3

Signed-off-by: Stephanie wang <[email protected]>

Simplify test_gpu.py

0c2af6f

Signed-off-by: Kai-Hsun Chen <[email protected]>

lint

758dbaa

Signed-off-by: Kai-Hsun Chen <[email protected]>

validate tensor_transport

4fc2cf9

Signed-off-by: Kai-Hsun Chen <[email protected]>

fix

1a55102

pass tensor_transport to C++

2d166c2

Signed-off-by: Kai-Hsun Chen <[email protected]>

pass tensor_transport to task_receiver

6757a28

Signed-off-by: Kai-Hsun Chen <[email protected]>

Revert "pass tensor_transport to C++"

d6d58b2

This reverts commit 2d166c2.

pass tensor_transport from task_receiver to cython callback

dfce3c5

Signed-off-by: Kai-Hsun Chen <[email protected]>

pass tensor_transport to serialize()

cd42e7b

Signed-off-by: Kai-Hsun Chen <[email protected]>

pass object_id to deserialize

733344d

Signed-off-by: Kai-Hsun Chen <[email protected]>

add comment for C++ frontend

9da07f8

Signed-off-by: Kai-Hsun Chen <[email protected]>

remove logs

e804ba9

Signed-off-by: Kai-Hsun Chen <[email protected]>

add tests

a1198b3

Signed-off-by: Kai-Hsun Chen <[email protected]>

stephanie-wang reviewed May 13, 2025

View reviewed changes

cpp/src/ray/runtime/task/task_executor.cc Outdated Show resolved Hide resolved

python/ray/actor.py Outdated Show resolved Hide resolved

python/ray/actor.py Outdated Show resolved Hide resolved

src/ray/common/task/task_spec.cc Outdated Show resolved Hide resolved

kevin85421 added 15 commits May 16, 2025 05:59

Merge remote-tracking branch 'upstream/master' into gpu-objects-poc-p…

ce75339

…ython

update

f068f67

Signed-off-by: Kai-Hsun Chen <[email protected]>

update

39f7dbf

Signed-off-by: Kai-Hsun Chen <[email protected]>

fix java compile

105bbe4

Signed-off-by: Kai-Hsun Chen <[email protected]>

fix ci

84c04fc

Signed-off-by: Kai-Hsun Chen <[email protected]>

fix test_actor_autocomplete

3620801

Signed-off-by: Kai-Hsun Chen <[email protected]>

add comments

44bc6ad

Signed-off-by: Kai-Hsun Chen <[email protected]>

fix no task args

7b91b8f

Signed-off-by: Kai-Hsun Chen <[email protected]>

add __ray_send__ and __ray_recv__

5566a0c

Signed-off-by: Kai-Hsun Chen <[email protected]>

refactor for readability

e751baf

Signed-off-by: Kai-Hsun Chen <[email protected]>

fix test_p2p

7d0d156

Signed-off-by: Kai-Hsun Chen <[email protected]>

fix test_output

4e3067c

Signed-off-by: Kai-Hsun Chen <[email protected]>

add custom_types

b1d999a

Signed-off-by: Kai-Hsun Chen <[email protected]>

add type hints for task submission

4bd3529

Signed-off-by: Kai-Hsun Chen <[email protected]>

rename NONE to OBJECT_STORE

0db78a1

Signed-off-by: Kai-Hsun Chen <[email protected]>

kevin85421 added 2 commits May 28, 2025 07:43

change tensor_transport type from string to enumfor actor task submis…

609c3a6

…sion Signed-off-by: Kai-Hsun Chen <[email protected]>

fix lint

afdf7d6

Signed-off-by: Kai-Hsun Chen <[email protected]>

add tests for GPUObjectManager

2884f4f

Signed-off-by: Kai-Hsun Chen <[email protected]>

Merge branch 'master' into gpu-objects-poc-python

c6823a5

stephanie-wang merged commit 2ff7298 into ray-project:master May 29, 2025
4 of 5 checks passed

kevin85421 mentioned this pull request Jun 3, 2025

Ecosystem integrations google/dranet#93

Open

13 tasks

kevin85421 mentioned this pull request Jun 5, 2025

[core][2/N] Avoid unnecessary deserialization/serialization of ObjectId #53574

Merged

8 tasks

kevin85421 added a commit to kevin85421/ray that referenced this pull request Jun 6, 2025

Revert "[core][gpu-objects] GPU Objects POC (ray-project#52938)"

b4212db

This reverts commit 2ff7298.

jjyao pushed a commit that referenced this pull request Jun 6, 2025

Revert "[core][gpu-objects] GPU Objects POC (#52938)" (#53602)

5c9f497

This reverts commit 2ff7298.

kevin85421 mentioned this pull request Jun 6, 2025

[core][gpu-objects] Performance regression caused by transferring object references for small objects #53623

Closed

stephanie-wang mentioned this pull request Jun 10, 2025

[core][gpu objects] Integrate single-controller collective APIs with GPU objects #53720

Merged

8 tasks

YMRTZ mentioned this pull request Jun 13, 2025

How to transfer tensors stored in GPU in actor with NCCL? #53816

Closed

kevin85421 mentioned this pull request Jun 13, 2025

[core][gpu-objects] Object contains multiple tensors and/or mix of CPU data and GPU tensors #51274

Closed

kevin85421 added the rdt Ray Direct Transport label Jun 15, 2025

weiquanlee pushed a commit to antgroup/ant-ray that referenced this pull request Aug 5, 2025

Revert "[core][gpu-objects] GPU Objects POC (ray-project#52938)" (ray…

b91364c

…-project#53602) This reverts commit 2ff7298.

crypdick reviewed Aug 14, 2025

View reviewed changes

python/ray/tests/test_gpu_objects.py

@@ -0,0 +1,153 @@

import sys

Copy link

Contributor

crypdick Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment for linking

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[core][gpu-objects] GPU Objects POC #52938

[core][gpu-objects] GPU Objects POC #52938

Uh oh!

kevin85421 commented May 12, 2025 •

edited

Loading

Uh oh!

stephanie-wang left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kevin85421 commented May 28, 2025

Uh oh!

kevin85421 commented May 29, 2025

Uh oh!

kevin85421 commented May 29, 2025

Uh oh!

Uh oh!

crypdick Aug 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[core][gpu-objects] GPU Objects POC #52938

[core][gpu-objects] GPU Objects POC #52938

Uh oh!

Conversation

kevin85421 commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

High-level

Details

Related issue number

Checks

Uh oh!

stephanie-wang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kevin85421 commented May 28, 2025

Uh oh!

kevin85421 commented May 29, 2025

Uh oh!

kevin85421 commented May 29, 2025

Uh oh!

Uh oh!

crypdick Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kevin85421 commented May 12, 2025 •

edited

Loading