[core][compiled graphs] Supporting allreduce on list of input nodes #51047

anyadontfly · 2025-03-03T23:03:08Z

Why are these changes needed?

Currently, only one dag node per actor can be bound to allreduce. When the users want to schedule output tensors from several dag nodes into one allreduce call, they would have to collect tensors into a tuple and return the tuple of tensors from a single dag node. For example,

@ray.remote
class Actor:
    def comp1(self, _):
        return torch.ones(1,)

    def comp2(self, _):
        return torch.ones(1,) * 2

    def get_results(self, *args):
        return tuple(args)

To launch allreduce on the result of comp1 and comp2, users need an additional function get_results to gather the result of comp1 and comp2.

with InputNode() as inp:
    res_comp1 = [actor.comp1.bind(inp) for actor in actors]
    res_comp2 = [actor.comp2.bind(inp) for actor in actors]
    res_tuple = [actor.get_results.bind(res_comp1[i], res_comp2[i]) for i, actor in enumerate(actors)]
    res_ar = allreduce.bind(res_tuple)

In this PR, users can simply put the result of comp1 and comp2 in a list, and launch allreduce on the list of outputs, which no longer requires the intermediate function get_results.

with InputNode() as inp:
    res_comp1 = [actor.comp1.bind(inp) for actor in actors]
    res_comp2 = [actor.comp2.bind(inp) for actor in actors]
    res_ar = allreduce.bind([res_comp1, res_comp2])

Related issue number

Meta-issue: #47983

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Puyuan Yao <[email protected]>

jjyao · 2025-04-29T20:23:03Z

@anyadontfly could you rebase, there is merge conflict

Signed-off-by: Puyuan Yao <[email protected]>

…written Signed-off-by: Puyuan Yao <[email protected]>

Signed-off-by: Puyuan Yao <[email protected]>

anyadontfly · 2025-05-27T20:32:54Z

Hi Stephanie, the PR is ready to run more tests. Thanks! @stephanie-wang

python/ray/dag/collective_node.py

python/ray/dag/tests/experimental/test_torch_tensor_dag.py

python/ray/experimental/collective/operations.py

Signed-off-by: Puyuan Yao <[email protected]>

python/ray/dag/collective_node.py

python/ray/experimental/collective/operations.py

python/ray/dag/collective_node.py

stephanie-wang · 2025-06-05T22:01:15Z

python/ray/experimental/collective/operations.py

+            for i in range(len(input_node_list)):
+                output_node = ClassMethodNode(
+                    f"return_idx_{i}",
+                    (collective_output_node, i),
+                    dict(),
+                    dict(),
+                    {
+                        BIND_INDEX_KEY: collective_output_node._get_bind_index(),
+                        IS_CLASS_METHOD_OUTPUT_KEY: True,
+                        PARENT_CLASS_NODE_KEY: actor_handle,
+                    },
+                )
+                output_nodes.append(output_node)


Actually I don't quite understand this logic. I thought we should always return the same type of node, just sometimes it will be a list and sometimes it will be a nested list. Is this method now returning either a list of ClassMethodNodes or a list of CollectiveOutputNodes? If so, please update it so that we only return one type of node.

Also, please update the type signature accordingly.

I used the same logic as ClassMethodNode with multiple return nodes so that we can execute the nccl operation once with multiple return nodes.

If no nested list passed in (normal allreduce), _bind will return CollectiveNode which is equivalent to a ClassMethodNode with _is_class_method_output=False.

If a nested list passed in (bucketed allreduce), _bind will return ClassMethodNode with _is_class_method_output=True and set _class_method_output to the original CollectiveNode so that the nccl operation will be execute only once at runtime instead of once for each object.

Can we modify this so that we always use the same logic for len(input_node_list) == 1 or N? Conceptually, it would be better if we always think of a CollectiveNode as taking in a nested list, and the only difference here should be whether we return a flat or nested list.

I changed the returned node for N input_node_lists to CollectiveNode.

I'm trying to inherit ClassMethodNode implementation of num_returns == 1 or N.

For allreducing 1 object, we can perform the allreduce and return the result from the CollectiveNode. But for allreducing N objects, we still want to perform allreduce only once on the CollectiveNode. So we need intermediate nodes to distribute results to different nodes.

If I use the same logic for 1 object or N objects, I can return an intermediate node from .bind as the case for N objects, but I think that would result in a redundant node.

python/ray/dag/collective_node.py

stephanie-wang · 2025-06-05T22:02:26Z

python/ray/dag/collective_node.py

+            collective operation. The output tensors have the same length and order
+            as the input node list of the actor of this operation.


This comment is confusing. There is no input node list in this function?

Yes I think so. The inputs and outputs are tensors in execute.

python/ray/dag/collective_node.py

Signed-off-by: Puyuan Yao <[email protected]>

python/ray/dag/collective_node.py

stephanie-wang · 2025-06-19T18:40:30Z

python/ray/dag/collective_node.py

+                recv_buf = torch.empty_like(t)
+                communicator.allreduce(t, recv_buf, self._op.reduceOp)
+            else:
+                recv_buf = tuple(torch.empty_like(t) for t in send_buf)


Why do we allocate a separate torch tensor for each input instead of one flat tensor?

I changed the implementation here to recv_buf pointing to flat_buf to avoid unnecessary memory allocation.

python/ray/experimental/collective/operations.py

stephanie-wang · 2025-06-19T18:48:51Z

python/ray/experimental/collective/operations.py

+            for i in range(len(input_node_list)):
+                output_node = ClassMethodNode(
+                    f"return_idx_{i}",
+                    (collective_output_node, i),
+                    dict(),
+                    dict(),
+                    {
+                        BIND_INDEX_KEY: collective_output_node._get_bind_index(),
+                        IS_CLASS_METHOD_OUTPUT_KEY: True,
+                        PARENT_CLASS_NODE_KEY: actor_handle,
+                    },
+                )
+                output_nodes.append(output_node)


Can we modify this so that we always use the same logic for len(input_node_list) == 1 or N? Conceptually, it would be better if we always think of a CollectiveNode as taking in a nested list, and the only difference here should be whether we return a flat or nested list.

Signed-off-by: Puyuan Yao <[email protected]>

stephanie-wang

Thanks, this looks good! Please just address the comment about dtypes.

python/ray/dag/collective_node.py

Signed-off-by: Puyuan Yao <[email protected]>

stephanie-wang · 2025-07-08T18:24:10Z

python/ray/dag/collective_node.py

@@ -212,21 +211,25 @@ def execute(
                recv_buf = torch.empty_like(t)
                communicator.allreduce(t, recv_buf, self._op.reduceOp)
            else:
+                if not all(t.dtype == send_buf[0].dtype for t in send_buf):


You can support this case by using torch.view. But it's fine to do it in a follow-up PR.

got it, thanks!

Signed-off-by: Puyuan Yao <[email protected]>

first try, support tuple of tensors

ebcb7c6

Signed-off-by: Puyuan Yao <[email protected]>

jcotant1 added the core Issues that should be addressed in Ray Core label Mar 4, 2025

anyadontfly added 3 commits March 4, 2025 11:51

progress

c3d436f

Signed-off-by: Puyuan Yao <[email protected]>

merge: upstream

74fe680

Signed-off-by: Puyuan Yao <[email protected]>

corner case of tuple of length 1

5477ba7

Signed-off-by: Puyuan Yao <[email protected]>

kevin85421 assigned ruisearch42 Mar 31, 2025

hainesmichaelc added the community-contribution Contributed by the community label Apr 4, 2025

jjyao assigned stephanie-wang Apr 29, 2025

hainesmichaelc added the community-backlog label May 22, 2025

anyadontfly added 2 commits May 22, 2025 00:47

merge: upstream

c641b37

Signed-off-by: Puyuan Yao <[email protected]>

binding list of dag nodes to collective op, allreduce only, no tests …

45eb68f

…written Signed-off-by: Puyuan Yao <[email protected]>

hainesmichaelc removed the community-backlog label May 22, 2025

format, added test, remove duplicated checks

93e58be

Signed-off-by: Puyuan Yao <[email protected]>

anyadontfly changed the title ~~[WIP][core][compiled graphs] Supporting allreduce on tuple of tensors~~ [core][compiled graphs] Supporting allreduce on list of input nodes May 22, 2025

anyadontfly added 5 commits May 22, 2025 12:12

detect list of nodes of length 1 to avoid unnecessary copying

b5adf53

Signed-off-by: Puyuan Yao <[email protected]>

unest single tensor in collective op execution

0663db2

Signed-off-by: Puyuan Yao <[email protected]>

lint

9515bb3

Signed-off-by: Puyuan Yao <[email protected]>

sync branch

703b2ff

Signed-off-by: Puyuan Yao <[email protected]>

add skipif for test

715ec24

Signed-off-by: Puyuan Yao <[email protected]>

stephanie-wang requested changes May 27, 2025

View reviewed changes

anyadontfly added 2 commits May 30, 2025 03:02

re-arrange checks, add test cases

b830856

Signed-off-by: Puyuan Yao <[email protected]>

sync branch

644818d

Signed-off-by: Puyuan Yao <[email protected]>

anyadontfly requested a review from stephanie-wang May 30, 2025 18:12

stephanie-wang reviewed Jun 5, 2025

View reviewed changes

anyadontfly added 4 commits June 16, 2025 05:12

change bucket binding logic

5485dab

Signed-off-by: Puyuan Yao <[email protected]>

sync branch

ef57192

Signed-off-by: Puyuan Yao <[email protected]>

fix test for cpu communicator

feb5c50

Signed-off-by: Puyuan Yao <[email protected]>

fix execute

9d3c871

Signed-off-by: Puyuan Yao <[email protected]>

stephanie-wang reviewed Jun 19, 2025

View reviewed changes

anyadontfly added 4 commits June 20, 2025 02:11

update docstring and comments, minor fixs

0310403

Signed-off-by: Puyuan Yao <[email protected]>

recv_buf references towards the flat_buf instead of allocating new ones

f5ba25b

Signed-off-by: Puyuan Yao <[email protected]>

sync branch

7cd0291

Signed-off-by: Puyuan Yao <[email protected]>

simply logics

8d095b3

Signed-off-by: Puyuan Yao <[email protected]>

anyadontfly requested a review from stephanie-wang June 25, 2025 03:53

stephanie-wang reviewed Jul 2, 2025

View reviewed changes

python/ray/dag/collective_node.py Outdated Show resolved Hide resolved

anyadontfly added 2 commits July 3, 2025 22:44

check dtype

9a1816d

Signed-off-by: Puyuan Yao <[email protected]>

sync branch

f42c2f6

Signed-off-by: Puyuan Yao <[email protected]>

stephanie-wang reviewed Jul 8, 2025

View reviewed changes

stephanie-wang approved these changes Jul 8, 2025

View reviewed changes

stephanie-wang enabled auto-merge (squash) July 8, 2025 18:24

github-actions bot added the go add ONLY when ready to merge, run all tests label Jul 8, 2025

anyadontfly added 2 commits July 8, 2025 12:46

sync branch

ef6ce46

Signed-off-by: Puyuan Yao <[email protected]>

docstring fix

f4de4af

Signed-off-by: Puyuan Yao <[email protected]>

auto-merge was automatically disabled July 8, 2025 20:41
Head branch was pushed to by a user without write access

stephanie-wang merged commit 8daa431 into ray-project:master Jul 9, 2025
5 checks passed

		collective operation. The output tensors have the same length and order
		as the input node list of the actor of this operation.

[core][compiled graphs] Supporting allreduce on list of input nodes #51047

[core][compiled graphs] Supporting allreduce on list of input nodes #51047

Uh oh!

Conversation

anyadontfly commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

jjyao commented Apr 29, 2025

Uh oh!

anyadontfly commented May 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephanie-wang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

anyadontfly commented Mar 3, 2025 •

edited

Loading