[simplefsdp] add manual bucketing pass #165487

ruisizhang123 · 2025-10-14T22:56:06Z

As titled, this PR adds manual bucketing pass to SimpleFSDP. Users will need to parse FQNs they wanted to bucket together using module_bucket_plans. Then, _manual_bucket_collectives will get the node of the subgraphs correspond to each bucket_module, and bucket bucketable (FSDP-style) AG/RS together. _manual_reorder_graph reorders them for overlapping.

For detailed performance, see this torchtitan PR: pytorch/torchtitan#1881.

There are a few todo items isted in torchtitan PR. Let's start with this PR that implements FSDP+TP+llama3 manual bucketing. I will fix/add the rest in follow up PRs.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

pytorch-bot · 2025-10-14T22:56:10Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/165487

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures

As of commit 7bf56fc with merge base 87d17e9 ():

NEW FAILURES - The following jobs have failed:

inductor / unit-test / inductor-test / test (inductor_distributed, 1, 1, linux.g5.12xlarge.nvidia.gpu) (gh)
test/distributed/test_aten_comm_compute_reordering.py::TestManualOverlapBucketing::test_manual_reordering_bucketing_pass
Lint / lintrunner-noclang-partial / linux-job (gh)
>>> Lint for torch/_inductor/fx_passes/graph_view.py:
Lint / lintrunner-pyrefly-partial / linux-job (gh)
>>> Lint for torch/_inductor/fx_passes/overlap_manual_scheduling.py:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ezyang · 2025-10-29T03:35:36Z

Can't land this without some tests. Would it be better for this to live in torchtitan? I can't tell... but since simple fsdp is in torchtitan and not in core, maybe it should live there!

torch/_inductor/fx_passes/overlap_manual_scheduling.py

ezyang · 2025-10-29T03:37:47Z

torch/_inductor/fx_passes/overlap_manual_scheduling.py

+
+def make_graph_view(graph: fx.Graph) -> Container:
+    """
+    Code from: https://github.com/meta-pytorch/autoparallel/pull/158


If we're putting this in core, let's do it for real. Dedicated module, docs, tests. cc @fmassa @eellison. I haven't used this API yet so I don't have a real world opinion on it.

Also have Claude generate some simple unit tests for it

ezyang · 2025-10-29T03:39:01Z

torch/_inductor/fx_passes/overlap_manual_scheduling.py

+    The manual overlapping consists of two steps:
+    Step 1: bucket all-gather/reduce-scatter in each module in module_bucket_plans
+    Step 2: reorder all-gather to overlap with last module_bucket &
+        reorder reduce-scatter to overlap with next module_bucket


The reordering here is just the SimpleFSDP strategy that you used to have in Inductor IR, right?

ruisizhang123 · 2025-10-29T16:20:09Z

Can't land this without some tests. Would it be better for this to live in torchtitan? I can't tell... but since simple fsdp is in torchtitan and not in core, maybe it should live there!

This is the starting of manual optimization pass that is independent of simplefsdp, which I think should live in core. Currently, it uses similar reordering strategy, but I have a list of todo items to make it general here: pytorch/torchtitan#1881. I will add them in followup PRs.

I will add tests to it to make sure it can live in core :)

ezyang · 2025-10-30T04:03:16Z

Well, @eellison has the really generic one, so I am not sure how much I want to push on this outside of simple fsdp generated all gathers ;)

ruisizhang123 · 2025-10-30T05:06:19Z

Well, @eellison has the really generic one, so I am not sure how much I want to push on this outside of simple fsdp generated all gathers ;)

hmmm a better way to put it would be: you cannot guarantee this automated overlapping would squeeze the best perf out of box. In some cases, users may want to control how things are bucketed & reordered either for better overlapping (they found some parts are not overlapped and want to manually do the overlapping) or for bit-wise loss equivalence (the fsdp2 bucketing thing we are comparing rn).

In both cases, having a controllable overlapping pass would be helpful.

ezyang · 2025-10-30T19:10:27Z

Bucketing, absolutely. Overlapping, I think you potentially may need something even more manual than this.

ruisizhang123 · 2025-10-31T01:53:22Z

Bucketing, absolutely. Overlapping, I think you potentially may need something even more manual than this.

yes, we need more explicit overlapping API for sure. On the other hand, fsdp2's fully_shard API also does similar things, where users specify which module they want to bucket, and the overlapping happens under the hood.

I believe this fx-level overlapping would give us more freedom to move things. This PR is just a starting point.

ezyang · 2025-11-03T04:09:49Z

torch/_inductor/fx_passes/overlap_manual_scheduling.py

+)
+from torch._inductor.fx_passes.overlap_preserving_bucketer import (
+    bucket_key,
+    OverlapPreservingBucketer,


@eellison to comment on inheritance here; note that we have some internal code also inheriting from this class in a similar way too.

ruisizhang123 · 2025-11-04T07:28:02Z

ehhhh just realized it's hard to add end-to-end manual bucketing & overlapping test in pytorch since simplefsdp is in torchtitan 😅

Another reason to consider moving simplefsdp to pytorch when more features like this are coming in lolll. I guess there is not another way other than simplefsdp that [define a model & manual bucketing FQNs -> FSDP sharding -> trace full graph -> do bucketing & overlapping]

torch/_inductor/fx_passes/overlap_manual_scheduling.py

wconstab · 2025-11-06T21:42:00Z

torch/_inductor/fx_passes/overlap_manual_scheduling.py

+    return name
+
+
+def _find_key_nodes(nodes: list[fx.Node]) -> tuple[list[fx.Node], list[fx.Node]]:


i'd appreciate a docstring at least, or a renamed function, that makes it more intuitive what this does. (what is a "key" node?)

wconstab · 2025-11-06T21:44:25Z

torch/_inductor/fx_passes/overlap_manual_scheduling.py

+    return root, outputs
+
+
+def _make_subgraph(nodes: list[fx.Node]) -> fx.Graph:


similar comment as above: intuitively a function named "make_subgraph" i'd expect some input arg that helps identify what part of the graph to include in the subgraph. In this case it sounds like it makes the 'subgraph containing key nodes', which, is still opaque to me.

wconstab · 2025-11-06T21:52:08Z

torch/_inductor/fx_passes/overlap_manual_scheduling.py

+    Get subgraph by path(s).
+    Args:
+        graph_view (object): Root graph view object.
+        paths (str or list of str): Path(s) to subgraph.


any requirements here? overlapping paths ok? disjoint paths ok?

there is no requirement. this function will collect all of the nodes as long as they belong to the path(s). This is for the case that we may want to put nodes from multiple paths in one bucket. The disjoint graph assertion bucket should happen here to ensure two buckets have no overlap. Will add an assertion in _obtain_nodes_in_subgraph func.

wconstab · 2025-11-06T21:56:11Z

torch/_inductor/fx_passes/overlap_manual_scheduling.py

+    return stack == ""
+
+
+def make_graph_view(graph: fx.Graph) -> Container:


its a little odd to me that this function returns a 'Container' type, and then on the Container you can further call .graph_view() to get an fx graphmodule type. I might not be understanding the design yet, but i wonder if it would be cleaner to have

Container class renamed to GraphView class

functions like get_subgraph_by_path moved to class methods of GraphView

make_graph_view moves to GraphView.init

GraphView.graph_view() still doesn't make sense to me

We are using Container class to have a hierarchical mapping between nodes & their module names. I agree it would be more intuitive to rename Container to Graph_view, will do this. We can also rename GraphView.graph_view() to sth like GraphView.obtain_subgraph()?

I realized GraphView.graph_view() is not actually used in our bucketing & reordering pass to get subgraphs. Instead, I decided to use get_subgraph_by_path func, which is more friendly to input FQNs as strings from torchtitan. Thus, I removed the legacy functions including_make_subgraph, _find_key_nodes and graph_view in the new code.

ok removing those functions will address a lot of my questions.

eellison · 2025-11-07T22:39:19Z

torch/_inductor/fx_passes/overlap_manual_scheduling.py

+    return stack == ""
+
+
+def make_graph_view(graph: fx.Graph) -> GraphView:


Would it make sense to put this in a differnet file?

eellison · 2025-11-07T22:39:54Z

torch/_inductor/fx_passes/overlap_manual_scheduling.py

+
+        return seen_target_op == 1
+
+    def _bucket_group(self, coll_nodes: list[fx.Node]) -> None:


reason you had to define this, just curious ?

It is a bit different than _apply_bucket FWIW

I wanted to add more info to bucketed nodes to help reordering: (1) tag newly bucketed AG/RS's metadata, which helps to identify FSDP related comms in reordering; (2) keep track of newly bucketed wait and its mapping to bucketed AG/RS as self.wait_to_node_map. It would help adding dependencies in reordering.

eellison · 2025-11-07T22:40:27Z

torch/_inductor/fx_passes/overlap_manual_scheduling.py

+        return node.data
+
+
+class ManualOverlapPreservingBucketer(OverlapPreservingBucketer):


What parts of the existing class are you using, just for my understanding?

for OverlapPreservingBucketer, I'm only using __init__ func to get self.collective_info and self.node_ancestors. The manual bucketing logic is much simpler -- you basically put things in a bucket and call bucket helper functions.

eellison · 2025-11-07T22:41:48Z

torch/_inductor/fx_passes/overlap_manual_scheduling.py

+    Scheduler that manual buckets and reorders collective nodes based on module_bucket_plans
+    """
+
+    def __init__(self, gm: fx.GraphModule, module_bucket_plans: list[list[str] | str]):


Same question here ? I guess this is heapq, mostly? anything else?

yes, mostly helper variables (e.g., collective_info) and heapq to help reordering & bucketing.

ezyang · 2025-11-11T04:07:34Z

torch/_inductor/fx_passes/overlap_manual_scheduling.py

+
+
+def _get_module_stack(node: fx.Node) -> list[tuple[str, type[Any]]]:
+    if node.meta.get("nn_module_stack", "") == "":


err why not just test if it's None?

ezyang · 2025-11-11T04:12:06Z

torch/_inductor/fx_passes/overlap_manual_scheduling.py

+        .replace("_modules['", "")
+        .replace("['", ".")
+        .replace("']", "")
+    )


This is kind of terrible; also not sure if there's a preexisting utility for this

I can rewrite it with regex function that extracts MODULE inside ['MODULE']. yeah, this one is a bit fragile.

torch/_inductor/fx_passes/overlap_manual_scheduling.py

ezyang

this has been stuck in review hell for a while, stamping to unblock, it's really low risk to put in

ruisizhang123 marked this pull request as draft October 14, 2025 22:56

pytorch-bot bot added ciflow/inductor module: inductor labels Oct 14, 2025

ruisizhang123 added the topic: not user facing topic category label Oct 14, 2025

ruisizhang123 mentioned this pull request Oct 15, 2025

[SimpleFSDP] add manual bucketing pass pytorch/torchtitan#1881

Open

4 tasks

ruisizhang123 force-pushed the ruisi/manual_bucket_pass branch 4 times, most recently from dfb1e41 to d94f913 Compare October 17, 2025 16:03

ruisizhang123 force-pushed the ruisi/manual_bucket_pass branch 4 times, most recently from 083bd96 to 028bf2b Compare October 28, 2025 06:57

ruisizhang123 changed the title ~~[WIP] add manual bucketing pass~~ [simplefsdp] add manual bucketing pass Oct 28, 2025

ruisizhang123 marked this pull request as ready for review October 28, 2025 07:05

ruisizhang123 requested review from IvanKobzarev, eellison and ezyang October 28, 2025 07:05

ruisizhang123 force-pushed the ruisi/manual_bucket_pass branch from 028bf2b to a4d1ed8 Compare October 28, 2025 15:42

ruisizhang123 requested a review from wconstab October 28, 2025 17:09

ruisizhang123 force-pushed the ruisi/manual_bucket_pass branch from a4d1ed8 to f31507d Compare October 28, 2025 17:45

ezyang reviewed Oct 29, 2025

View reviewed changes

torch/_inductor/fx_passes/overlap_manual_scheduling.py Outdated Show resolved Hide resolved

ezyang reviewed Oct 29, 2025

View reviewed changes

ezyang reviewed Nov 3, 2025

View reviewed changes

ruisizhang123 force-pushed the ruisi/manual_bucket_pass branch from f31507d to 3eaadd0 Compare November 4, 2025 07:24

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Nov 4, 2025

ruisizhang123 requested review from ezyang and fmassa November 4, 2025 07:25

wconstab reviewed Nov 4, 2025

View reviewed changes

torch/_inductor/fx_passes/overlap_manual_scheduling.py Outdated Show resolved Hide resolved

ruisizhang123 force-pushed the ruisi/manual_bucket_pass branch 2 times, most recently from 47b1982 to c472083 Compare November 5, 2025 03:59

ruisizhang123 requested a review from wconstab November 6, 2025 05:57

wconstab reviewed Nov 6, 2025

View reviewed changes

ruisizhang123 force-pushed the ruisi/manual_bucket_pass branch 2 times, most recently from 193315c to ad08041 Compare November 7, 2025 02:05

eellison reviewed Nov 7, 2025

View reviewed changes

ezyang reviewed Nov 11, 2025

View reviewed changes

torch/_inductor/fx_passes/overlap_manual_scheduling.py Outdated Show resolved Hide resolved

ezyang approved these changes Nov 11, 2025

View reviewed changes

ruisizhang123 force-pushed the ruisi/manual_bucket_pass branch from ad08041 to 8900629 Compare November 11, 2025 06:49

add manual bucketing

7bf56fc

ruisizhang123 force-pushed the ruisi/manual_bucket_pass branch from 8900629 to 7bf56fc Compare November 11, 2025 06:51

		return name


		def _find_key_nodes(nodes: list[fx.Node]) -> tuple[list[fx.Node], list[fx.Node]]:

		return root, outputs


		def _make_subgraph(nodes: list[fx.Node]) -> fx.Graph:

		return stack == ""


		def make_graph_view(graph: fx.Graph) -> Container:


		return seen_target_op == 1

		def _bucket_group(self, coll_nodes: list[fx.Node]) -> None:

		return node.data


		class ManualOverlapPreservingBucketer(OverlapPreservingBucketer):



		def _get_module_stack(node: fx.Node) -> list[tuple[str, type[Any]]]:
		if node.meta.get("nn_module_stack", "") == "":

[simplefsdp] add manual bucketing pass #165487

Are you sure you want to change the base?

[simplefsdp] add manual bucketing pass #165487

Conversation

ruisizhang123 commented Oct 14, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/165487

❌ 3 New Failures

Uh oh!

ezyang commented Oct 29, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ruisizhang123 commented Oct 29, 2025

Uh oh!

ezyang commented Oct 30, 2025

Uh oh!

ruisizhang123 commented Oct 30, 2025

Uh oh!

ezyang commented Oct 30, 2025

Uh oh!

ruisizhang123 commented Oct 31, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ruisizhang123 commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ruisizhang123 Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ruisizhang123 Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ruisizhang123 Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ruisizhang123 commented Oct 14, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Oct 14, 2025 •

edited

Loading

ruisizhang123 commented Nov 4, 2025 •

edited

Loading

ruisizhang123 Nov 6, 2025 •

edited

Loading

ruisizhang123 Nov 6, 2025 •

edited

Loading

ruisizhang123 Nov 11, 2025 •

edited

Loading