-
Notifications
You must be signed in to change notification settings - Fork 6.7k
[core][compiled graphs] Supporting allreduce on list of input nodes #51047
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core][compiled graphs] Supporting allreduce on list of input nodes #51047
Conversation
Signed-off-by: Puyuan Yao <[email protected]>
Signed-off-by: Puyuan Yao <[email protected]>
Signed-off-by: Puyuan Yao <[email protected]>
Signed-off-by: Puyuan Yao <[email protected]>
@anyadontfly could you rebase, there is merge conflict |
Signed-off-by: Puyuan Yao <[email protected]>
…written Signed-off-by: Puyuan Yao <[email protected]>
Signed-off-by: Puyuan Yao <[email protected]>
Signed-off-by: Puyuan Yao <[email protected]>
Signed-off-by: Puyuan Yao <[email protected]>
Signed-off-by: Puyuan Yao <[email protected]>
Signed-off-by: Puyuan Yao <[email protected]>
Signed-off-by: Puyuan Yao <[email protected]>
Hi Stephanie, the PR is ready to run more tests. Thanks! @stephanie-wang |
Signed-off-by: Puyuan Yao <[email protected]>
Signed-off-by: Puyuan Yao <[email protected]>
for i in range(len(input_node_list)): | ||
output_node = ClassMethodNode( | ||
f"return_idx_{i}", | ||
(collective_output_node, i), | ||
dict(), | ||
dict(), | ||
{ | ||
BIND_INDEX_KEY: collective_output_node._get_bind_index(), | ||
IS_CLASS_METHOD_OUTPUT_KEY: True, | ||
PARENT_CLASS_NODE_KEY: actor_handle, | ||
}, | ||
) | ||
output_nodes.append(output_node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I don't quite understand this logic. I thought we should always return the same type of node, just sometimes it will be a list and sometimes it will be a nested list. Is this method now returning either a list of ClassMethodNodes or a list of CollectiveOutputNodes? If so, please update it so that we only return one type of node.
Also, please update the type signature accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used the same logic as ClassMethodNode
with multiple return nodes so that we can execute the nccl operation once with multiple return nodes.
If no nested list passed in (normal allreduce), _bind
will return CollectiveNode
which is equivalent to a ClassMethodNode
with _is_class_method_output=False
.
If a nested list passed in (bucketed allreduce), _bind
will return ClassMethodNode
with _is_class_method_output=True
and set _class_method_output
to the original CollectiveNode
so that the nccl operation will be execute only once at runtime instead of once for each object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we modify this so that we always use the same logic for len(input_node_list) == 1 or N? Conceptually, it would be better if we always think of a CollectiveNode as taking in a nested list, and the only difference here should be whether we return a flat or nested list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the returned node for N input_node_list
s to CollectiveNode
.
I'm trying to inherit ClassMethodNode
implementation of num_returns
== 1 or N.
For allreducing 1 object, we can perform the allreduce and return the result from the CollectiveNode
. But for allreducing N objects, we still want to perform allreduce only once on the CollectiveNode
. So we need intermediate nodes to distribute results to different nodes.
If I use the same logic for 1 object or N objects, I can return an intermediate node from .bind
as the case for N objects, but I think that would result in a redundant node.
collective operation. The output tensors have the same length and order | ||
as the input node list of the actor of this operation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is confusing. There is no input node list in this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think so. The inputs and outputs are tensors in execute
.
Signed-off-by: Puyuan Yao <[email protected]>
Signed-off-by: Puyuan Yao <[email protected]>
Signed-off-by: Puyuan Yao <[email protected]>
Signed-off-by: Puyuan Yao <[email protected]>
python/ray/dag/collective_node.py
Outdated
recv_buf = torch.empty_like(t) | ||
communicator.allreduce(t, recv_buf, self._op.reduceOp) | ||
else: | ||
recv_buf = tuple(torch.empty_like(t) for t in send_buf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we allocate a separate torch tensor for each input instead of one flat tensor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the implementation here to recv_buf
pointing to flat_buf
to avoid unnecessary memory allocation.
for i in range(len(input_node_list)): | ||
output_node = ClassMethodNode( | ||
f"return_idx_{i}", | ||
(collective_output_node, i), | ||
dict(), | ||
dict(), | ||
{ | ||
BIND_INDEX_KEY: collective_output_node._get_bind_index(), | ||
IS_CLASS_METHOD_OUTPUT_KEY: True, | ||
PARENT_CLASS_NODE_KEY: actor_handle, | ||
}, | ||
) | ||
output_nodes.append(output_node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we modify this so that we always use the same logic for len(input_node_list) == 1 or N? Conceptually, it would be better if we always think of a CollectiveNode as taking in a nested list, and the only difference here should be whether we return a flat or nested list.
Signed-off-by: Puyuan Yao <[email protected]>
Signed-off-by: Puyuan Yao <[email protected]>
Signed-off-by: Puyuan Yao <[email protected]>
Signed-off-by: Puyuan Yao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks good! Please just address the comment about dtypes.
Signed-off-by: Puyuan Yao <[email protected]>
Signed-off-by: Puyuan Yao <[email protected]>
@@ -212,21 +211,25 @@ def execute( | |||
recv_buf = torch.empty_like(t) | |||
communicator.allreduce(t, recv_buf, self._op.reduceOp) | |||
else: | |||
if not all(t.dtype == send_buf[0].dtype for t in send_buf): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can support this case by using torch.view. But it's fine to do it in a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, thanks!
Signed-off-by: Puyuan Yao <[email protected]>
Signed-off-by: Puyuan Yao <[email protected]>
Head branch was pushed to by a user without write access
Why are these changes needed?
Currently, only one dag node per actor can be bound to allreduce. When the users want to schedule output tensors from several dag nodes into one allreduce call, they would have to collect tensors into a tuple and return the tuple of tensors from a single dag node. For example,
To launch allreduce on the result of comp1 and comp2, users need an additional function
get_results
to gather the result ofcomp1
andcomp2
.In this PR, users can simply put the result of comp1 and comp2 in a list, and launch allreduce on the list of outputs, which no longer requires the intermediate function
get_results
.Related issue number
Meta-issue: #47983
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.