Refactor PP splitting #1416

H-Huang · 2025-07-18T19:30:44Z

This refactors the PP splitting logic to consolidate around settings FQNs for each model chunk. For example:

[
  ['tok_embeddings', 'layers.0'], # stage0
  ['layers.1', 'layers.2'], # stage1
  ['layers.3', 'layers.4'], # stage2
  ... # so on...
]

This is better because it can generally be applied to all models, and the code can be re-used for cases that don't explicitly require pipelined execution (for example, streaming diloco needs to communicate model chunks)

Changes:

Refactor deepseekv3 and llama to share the same pipeline util functions
Add module_names_per_model_chunk config, deprecate pipeline_parallel_split_points

TODO (follow up PRs):

pipeline_module_split will be upstreamed to PyTorch as a torch.distributed.pipelining utility since it contains no model specific code.
Additional changes are needed to get this to work for torchft streaming diloco including updating the training loop to not execute if the pipeline schedule isn't set and making sure the pipelining_fn return the correct model chunks.

cc @tushar00jain

tushar00jain · 2025-07-18T21:26:03Z

@H-Huang should the config be in terms of FQN's of the split points instead of having to specify the whole model parameter FQN's?

H-Huang · 2025-07-21T14:41:12Z

@tushar00jain Right now the model doesn't use any torch.distributed.pipelining.SplitPoint in its definition. The model is manually split by removing the layers and converting the forward function (

torchtitan/torchtitan/models/llama3/model/model.py

Lines 337 to 352 in beb29a1

    
           self.tok_embeddings = nn.Embedding(model_args.vocab_size, model_args.dim) 
        
           # TODO persistent should be set to false, since this buffer can be recomputed. 
        
           # however, we set it to true for 2 reasons.  (1) due to pytorch/pytorch#123411, 
        
           # compile or pipeline-tracer will not correctly handle non-persistent buffers, 
        
           # so we need to fix that.  (2) if we initialize pipeline-parallel models from 
        
           # a seed checkpoint rather than calling init_weights, we need freqs_cis to be 
        
           # initialized by the checkpoint, or we need to add a separate initializer for 
        
           # just the non-persistent buffers that is called after loading checkpoints. 
        
           self.register_buffer("freqs_cis", self._precompute_freqs_cis(), persistent=True) 
        
           self.layers = torch.nn.ModuleDict() 
        
           for layer_id in range(model_args.n_layers): 
        
               self.layers[str(layer_id)] = TransformerBlock(layer_id, model_args) 
        
           self.norm = nn.RMSNorm(model_args.dim, eps=model_args.norm_eps) 
        
           self.output = nn.Linear(model_args.dim, model_args.vocab_size, bias=False)

). If we were to use pipeline to split the model we can revisit this, but I believe when it was first tested there were some issues tracing the full model.

tushar00jain · 2025-07-21T16:48:20Z

@H-Huang sure, btw should we also have a script to generate this list for llama models for debugging? since the list can be large, maybe put it in a file and have the torchtitan config read from a specified file?

weifengpy · 2025-07-22T03:22:27Z

/gemini review

weifengpy · 2025-07-22T03:23:22Z

@gemini-code-assist review

tushar00jain · 2025-07-22T21:51:28Z

torchtitan/config_manager.py

    but currently the split points must be specified manually.
    """

+    module_names_per_model_chunk: list[list[str]] = field(default_factory=list)


should this go under a different section since it'll also be reused by ft? if both pp and ft are enabled, we use the same chunking for both right? or do we want to allow specifying it differently for both?

it's probably good to just keep the chunking for pp and ft separate. we'll need to figure how to combine them eventually where we split each pipeline chunk into diloco fragments

torchtitan/distributed/pipeline.py

tianyu-l

I think this PR is mixing two things: PP refactor and adding new FT capability. While they share some utilities, the complexity is very high. So I suggest we split into at least two PRs.

For PP -- it's not only refactor, but also trim some functionalities we added recently. I suggest we be careful on the consequence and also do some investigation on Megatron PP so that we don't miss opportunities.

For FT, I feel the complexity is worth new training scripts and abstractions, instead of piggyback on existing ones.

torchtitan/config_manager.py

tianyu-l · 2025-07-23T06:18:43Z

torchtitan/config_manager.py


+    module_names_per_model_chunk: list[list[str]] = field(default_factory=list)
+    """
+    Specify a list of lists containing the FQNs (Fully Qualified Names) of modules for each model chunk.


can the fqns be given at arbitrary granularity? e.g. say split happens inside a TransformerBlock

Yes, that's what makes it more flexible

tianyu-l · 2025-07-23T06:33:33Z

torchtitan/distributed/pipeline.py

        return stage_v_pairs[pp_rank]
+
+
+def generate_module_names_per_stage(


It seems you are removing the option num_layers_per_stage (originally from internal training) which we worked very hard to make correct. Is there a strong reason it won't be needed? In general I see people use as many num_stages as possible to reduce bubble size, which I don't see us encouraging -- we are almost encouraging the opposite.

Also, it's a good time to do some due diligence on PP feature parity with Megatron.
This is some PP config people use https://github.com/yanring/Megatron-MoE-ModelZoo/blob/main/model_configs/benchmarking/DeepSeek-V3.yaml#L136
Could you help understand this and make sure our PP could support it?

I did miss that num_layers_per_stage was no longer supported. I added support for this. Users can choose 3 options:

use layers per stage (and we determine the num stages)

use stage FQNs

by default we determine the stage FQNs (this requires no extra configs from the user). So this refactor doesn't change the UX for the user.

Also, it's a good time to do some due diligence on PP feature parity with Megatron.

This is a good flag. I looked at what they're doing and it seems like they have a custom string format that they parse to determine their splitting, | denotes the split points with different chars representing modules. https://github.com/NVIDIA/Megatron-LM/blob/9327e1978d29b574d8db7a3d43121a2a94986b3f/docs/source/api-guide/pipeline_parallel_layout.md?plain=1#L5. Functionally, we are doing something similar albeit our module FQNs is more verbose, but it is also more general to handle more models/architectures. In our code, since the FQN generation is programmatic , users could also build off that (that is what TorchFT is going to do for their model chunking in streaming diloco)

tianyu-l · 2025-07-23T06:38:12Z

torchtitan/train.py

+            ft = job_config.fault_tolerance
+
+            if ft.enable:
+                if self.train_spec.ft_fragment_fn:
+                    self.ft_model_parts = self.train_spec.ft_fragment_fn(model, job_config, model_args)
+                else:
+                    self.ft_model_parts = [model]
+
+


It has come to a point where I don't think we should add more ft code into train.py. E.g. it seems here the code doesn't work with PP.

Maybe it's time to consider adding an async_train.py?

@tianyu-l right they don't work together right now, that'll require some additional changes though we should support that eventually. most of the code will be copy pasted though with all the code blocks from if ft.enabled? if we do that it'll be hard to maintain both the files, we'll have to make sure all changes go to both files. maybe better to split off pipeline parallel to separate training file since the training for pp actually looks different. it'll still be some duplication though.

tianyu-l · 2025-07-23T06:45:38Z

torchtitan/protocols/train_spec.py

    build_dataloader_fn: DataLoaderBuilder
    build_tokenizer_fn: TokenizerBuilder | None
    build_loss_fn: LossFunctionBuilder
+    ft_fragment_fn: FTFragmentFunction | None = None


TrainSpec contains the most fundamental components in model training. Every model needs to define these components in order to claim "model support" in torchtitan.

It seems to me FT is bringing a new training paradigm (semi-sync / async training) which significantly deviates from traditional synchronous training.

I suggest we reconsider and take an approach which best reflects the complexity and difference, maybe with a new training script & abstractions (e.g. having SemiSyncTrainSpec inherit TrainSpec).

Also if the stuff is still in experimental phase, you could consider putting it in experiments folder instead of being blocked for too long.

Created a separate FaultTolerantTrainSpec

tushar00jain · 2025-07-23T13:27:18Z

@tianyu-l this is my bad. these are supposed to be 2 different pr's as there are 2 commits in this pr. one was created by me to work on top of @H-Huang's diff to add ft functionality. not sure why github added my commits to @H-Huang's pr which is on his fork 🤷

tushar00jain · 2025-07-23T13:33:58Z

@H-Huang can you remove my commit from your fork (not sure how it made it there) and force push your branch. that'll hopefully remove my commit from this pr

tianyu-l · 2025-07-23T16:54:38Z

@tushar00jain oh I see. Got it. No worries!

wconstab · 2025-07-24T01:55:33Z

torchtitan/distributed/pipeline.py

+
+    Example:
+        generate_llm_fqn_per_model_part(2, 3, input_weight=2, output_weight=2)
+        treats embeddings as 2 layers and norm+output as 2 layers for distribution


Did I misunderstand? Input weight is 1, but this says 2

In the example input_weight=2 so the input embeddings are treated as equal to 2 transformer layers

wconstab · 2025-07-24T01:58:34Z

torchtitan/distributed/pipeline.py

+    This API creates pipeline stages based on specified module names for each stage.
+
+    Args:
+        whole_model: The complete model to be split


In prep for upstreaming, we should document the assumptions on how this module should be written such that it's compatible with this utility.

forward function should tolerate deleted layers

init_weights should also

Do we require layers to be a moduledict?

That's a good point! Will add. The function supports any moduledict, modulelist, or regular module. ModuleDict and ModuleList both just add a "." to the FQN which we parse out. Though your question made we realize we don't support nested moduledicts/lists so will add a comment for that too

tianyu-l

In general looks good, but there are still some loose ends. Please address.

Also, let's re-enable all the PP tests in https://github.com/pytorch/torchtitan/blob/main/tests/integration_tests.py disabled earlier.

torchtitan/config/job_config.py

torchtitan/models/deepseek_v3/infra/pipeline.py

torchtitan/distributed/pipeline.py

tianyu-l · 2025-07-28T01:59:45Z

torchtitan/models/llama3/infra/pipeline.py

+
+        # Calculate number of virtual stages needed (using ceiling division)
+        # This allows for unequal distribution where stages can differ by at most 1 layer
+        num_virtual_stages = math.ceil(num_layers / layers_per_stage)


I found this to be less satisfying than my previous approach https://github.com/pytorch/torchtitan/pull/1416/files#diff-774b5ff5ea1f19065fae05cb4a9d3823becfe1b98422e0a3db5354853ce2b4cfL70-L71

Let's say we have num_layers = 6 and layers_per_stage = 2, then num_virtual_stages would be 3.
But if pp degree is 4 there will be exception below.

https://github.com/pytorch/torchtitan/blob/main/tests/integration_tests.py#L285
CI seems not capturing this failure, maybe should modify to 8 GPU with layer_per_stage = 2.

However, my intention was to have input_weight = output_weight = 1 and have total effective layers to be 8, which is evenly divisible by layers_per_stage = 2 to put things on 4 pp ranks.

To fix this issue, I can see two approaches:

put this logic back to the auto-split function generate_llm_fqn_per_model_part, with layers_per_stage being passed to that function

expose input_weights and output_weights as configs, so that you have access to them here. If they don't sound generic enought, we can take the internal approach mentioned earlier -- less_layer_first_pp_stage less_layer_last_pp_stage

I opted to expose input_weights and output_weights as configs and include them in my validation

https://github.com/pytorch/torchtitan/blob/main/tests/integration_tests.py#L285. CI seems not capturing this failure, maybe should modify to 8 GPU with layer_per_stage = 2.

Since this is interleaved1F1B for 4 ranks, that means there will be 8 stages, so 8 layers / 8 stages == 1 layer per stage.

torchtitan/distributed/pipeline.py

torchtitan/config/job_config.py

torchtitan/models/deepseek_v3/infra/pipeline.py

tianyu-l

almost good, one last concern

tianyu-l · 2025-07-30T04:41:41Z

torchtitan/distributed/pipeline.py

+            for _ in range(effective_layers_for_stage):
+                if current_layer < num_layers:
+                    stage_modules.append(f"layers.{current_layer}")
+                    current_layer += 1


I didn't fully get this comment #1416 (comment)

yeah on second thought, because we do some validation outside of this method i can remove this check

There are two levels of check.

feasibility check
At the very least we should assert the existence of >=1 layer in each PP stage?
Think of the following imbalanced setup:

num_stages = 200 num_layers = 0 input_weights = 100 output_weights = 100

You'll get

layers_per_stage = 1 extra_layers = 0

Then we would end up that PP stage 0 has input, PP stage 99 has output, the other stages are empty.

balance check

The reason we introduce input_weights output_weights is to balance the first & last stage, while keeping the other stages as balanced as possible.
Given that you want to allow for "slight imbalance" -- stages in the front can have x effective layers, stages in the back can have y effective layers, where y+1 >= x >= y -- we should check input_weights <= x and output_weights <= y.

Yeah i meant to say that there is validation when users pass in job_config.parallelism.pipeline_parallel_layers_per_stage to ensure that it makes sense for the model / schedule. Once we get into this method we don't do as much validation but I think the edge cases you mentioned are fair to point out. I added conditions for these:

https://github.com/pytorch/torchtitan/pull/1416/files#diff-774b5ff5ea1f19065fae05cb4a9d3823becfe1b98422e0a3db5354853ce2b4cfR168-R186

tianyu-l

LGTM, thanks for the refactor!

@tushar00jain

This refactors the PP splitting logic to consolidate around settings FQNs for each model chunk. For example: ``` [ ['tok_embeddings', 'layers.0'], # stage0 ['layers.1', 'layers.2'], # stage1 ['layers.3', 'layers.4'], # stage2 ... # so on... ] ``` This is better because it can generally be applied to all models, and the code can be re-used for cases that don't explicitly require pipelined execution (for example, streaming diloco needs to communicate model chunks) Changes: - Refactor deepseekv3 and llama to share the same pipeline util functions - Add module_names_per_model_chunk config, deprecate pipeline_parallel_split_points TODO (follow up PRs): - `pipeline_module_split` will be upstreamed to PyTorch as a `torch.distributed.pipelining` utility since it contains no model specific code. - Additional changes are needed to get this to work for torchft streaming diloco including updating the training loop to not execute if the pipeline schedule isn't set and making sure the pipelining_fn return the correct model chunks. cc @tushar00jain

@tushar00jain

This refactors the PP splitting logic to consolidate around settings FQNs for each model chunk. For example: ``` [ ['tok_embeddings', 'layers.0'], # stage0 ['layers.1', 'layers.2'], # stage1 ['layers.3', 'layers.4'], # stage2 ... # so on... ] ``` This is better because it can generally be applied to all models, and the code can be re-used for cases that don't explicitly require pipelined execution (for example, streaming diloco needs to communicate model chunks) Changes: - Refactor deepseekv3 and llama to share the same pipeline util functions - Add module_names_per_model_chunk config, deprecate pipeline_parallel_split_points TODO (follow up PRs): - `pipeline_module_split` will be upstreamed to PyTorch as a `torch.distributed.pipelining` utility since it contains no model specific code. - Additional changes are needed to get this to work for torchft streaming diloco including updating the training loop to not execute if the pipeline schedule isn't set and making sure the pipelining_fn return the correct model chunks. cc @tushar00jain

@tushar00jain

This refactors the PP splitting logic to consolidate around settings FQNs for each model chunk. For example: ``` [ ['tok_embeddings', 'layers.0'], # stage0 ['layers.1', 'layers.2'], # stage1 ['layers.3', 'layers.4'], # stage2 ... # so on... ] ``` This is better because it can generally be applied to all models, and the code can be re-used for cases that don't explicitly require pipelined execution (for example, streaming diloco needs to communicate model chunks) Changes: - Refactor deepseekv3 and llama to share the same pipeline util functions - Add module_names_per_model_chunk config, deprecate pipeline_parallel_split_points TODO (follow up PRs): - `pipeline_module_split` will be upstreamed to PyTorch as a `torch.distributed.pipelining` utility since it contains no model specific code. - Additional changes are needed to get this to work for torchft streaming diloco including updating the training loop to not execute if the pipeline schedule isn't set and making sure the pipelining_fn return the correct model chunks. cc @tushar00jain

H-Huang requested review from fegin, tianyu-l, wconstab and wwwjn as code owners July 18, 2025 19:30

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 18, 2025

tushar00jain reviewed Jul 22, 2025

View reviewed changes

torchtitan/distributed/pipeline.py Outdated Show resolved Hide resolved

tushar00jain force-pushed the pp_config branch 2 times, most recently from a11e9f5 to c4d11b7 Compare July 23, 2025 05:52

tianyu-l requested changes Jul 23, 2025

View reviewed changes

tushar00jain mentioned this pull request Jul 23, 2025

Refactor PP splitting #1445

Closed

H-Huang force-pushed the pp_config branch 2 times, most recently from 0f8749a to 1d1e4c6 Compare July 23, 2025 18:08

H-Huang requested a review from tianyu-l July 23, 2025 18:14

wconstab reviewed Jul 24, 2025

View reviewed changes

H-Huang force-pushed the pp_config branch from 1d1e4c6 to 1b0892c Compare July 24, 2025 16:35

tianyu-l requested changes Jul 28, 2025

View reviewed changes

tianyu-l reviewed Jul 28, 2025

View reviewed changes

torchtitan/distributed/pipeline.py Outdated Show resolved Hide resolved

torchtitan/config/job_config.py Outdated Show resolved Hide resolved

torchtitan/models/deepseek_v3/infra/pipeline.py Outdated Show resolved Hide resolved

H-Huang force-pushed the pp_config branch from 1b0892c to ccf36af Compare July 29, 2025 20:40

H-Huang requested a review from tianyu-l July 29, 2025 20:40

tianyu-l reviewed Jul 30, 2025

View reviewed changes

Refactor PP splitting

fed0dc6

H-Huang force-pushed the pp_config branch from ccf36af to fed0dc6 Compare July 30, 2025 16:50

H-Huang requested a review from tianyu-l July 30, 2025 16:50

tianyu-l approved these changes Jul 30, 2025

View reviewed changes

H-Huang merged commit 3c84ce0 into pytorch:main Jul 30, 2025
8 checks passed

		return stage_v_pairs[pp_rank]


		def generate_module_names_per_stage(

Refactor PP splitting #1416

Refactor PP splitting #1416

Uh oh!

Conversation

H-Huang commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tushar00jain commented Jul 18, 2025

Uh oh!

H-Huang commented Jul 21, 2025

Uh oh!

tushar00jain commented Jul 21, 2025

Uh oh!

weifengpy commented Jul 22, 2025

Uh oh!

weifengpy commented Jul 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tushar00jain commented Jul 23, 2025

Uh oh!

tushar00jain commented Jul 23, 2025

Uh oh!

tianyu-l commented Jul 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

H-Huang Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

H-Huang commented Jul 18, 2025 •

edited

Loading

H-Huang Jul 30, 2025 •

edited

Loading