Skip to content

Conversation

@H-Huang
Copy link
Member

@H-Huang H-Huang commented Jul 18, 2025

This refactors the PP splitting logic to consolidate around settings FQNs for each model chunk. For example:

[
  ['tok_embeddings', 'layers.0'], # stage0
  ['layers.1', 'layers.2'], # stage1
  ['layers.3', 'layers.4'], # stage2
  ... # so on...
]

This is better because it can generally be applied to all models, and the code can be re-used for cases that don't explicitly require pipelined execution (for example, streaming diloco needs to communicate model chunks)

Changes:

  • Refactor deepseekv3 and llama to share the same pipeline util functions
  • Add module_names_per_model_chunk config, deprecate pipeline_parallel_split_points

TODO (follow up PRs):

  • pipeline_module_split will be upstreamed to PyTorch as a torch.distributed.pipelining utility since it contains no model specific code.
  • Additional changes are needed to get this to work for torchft streaming diloco including updating the training loop to not execute if the pipeline schedule isn't set and making sure the pipelining_fn return the correct model chunks.

cc @tushar00jain

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 18, 2025
@tushar00jain
Copy link
Contributor

@H-Huang should the config be in terms of FQN's of the split points instead of having to specify the whole model parameter FQN's?

@H-Huang
Copy link
Member Author

H-Huang commented Jul 21, 2025

@tushar00jain Right now the model doesn't use any torch.distributed.pipelining.SplitPoint in its definition. The model is manually split by removing the layers and converting the forward function (

self.tok_embeddings = nn.Embedding(model_args.vocab_size, model_args.dim)
# TODO persistent should be set to false, since this buffer can be recomputed.
# however, we set it to true for 2 reasons. (1) due to pytorch/pytorch#123411,
# compile or pipeline-tracer will not correctly handle non-persistent buffers,
# so we need to fix that. (2) if we initialize pipeline-parallel models from
# a seed checkpoint rather than calling init_weights, we need freqs_cis to be
# initialized by the checkpoint, or we need to add a separate initializer for
# just the non-persistent buffers that is called after loading checkpoints.
self.register_buffer("freqs_cis", self._precompute_freqs_cis(), persistent=True)
self.layers = torch.nn.ModuleDict()
for layer_id in range(model_args.n_layers):
self.layers[str(layer_id)] = TransformerBlock(layer_id, model_args)
self.norm = nn.RMSNorm(model_args.dim, eps=model_args.norm_eps)
self.output = nn.Linear(model_args.dim, model_args.vocab_size, bias=False)
). If we were to use pipeline to split the model we can revisit this, but I believe when it was first tested there were some issues tracing the full model.

@tushar00jain
Copy link
Contributor

@H-Huang sure, btw should we also have a script to generate this list for llama models for debugging? since the list can be large, maybe put it in a file and have the torchtitan config read from a specified file?

@weifengpy
Copy link
Contributor

/gemini review

@weifengpy
Copy link
Contributor

@gemini-code-assist review

but currently the split points must be specified manually.
"""

module_names_per_model_chunk: list[list[str]] = field(default_factory=list)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this go under a different section since it'll also be reused by ft? if both pp and ft are enabled, we use the same chunking for both right? or do we want to allow specifying it differently for both?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's probably good to just keep the chunking for pp and ft separate. we'll need to figure how to combine them eventually where we split each pipeline chunk into diloco fragments

@tushar00jain tushar00jain force-pushed the pp_config branch 2 times, most recently from a11e9f5 to c4d11b7 Compare July 23, 2025 05:52
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is mixing two things: PP refactor and adding new FT capability. While they share some utilities, the complexity is very high. So I suggest we split into at least two PRs.

For PP -- it's not only refactor, but also trim some functionalities we added recently. I suggest we be careful on the consequence and also do some investigation on Megatron PP so that we don't miss opportunities.

For FT, I feel the complexity is worth new training scripts and abstractions, instead of piggyback on existing ones.


module_names_per_model_chunk: list[list[str]] = field(default_factory=list)
"""
Specify a list of lists containing the FQNs (Fully Qualified Names) of modules for each model chunk.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can the fqns be given at arbitrary granularity? e.g. say split happens inside a TransformerBlock

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what makes it more flexible

return stage_v_pairs[pp_rank]


def generate_module_names_per_stage(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems you are removing the option num_layers_per_stage (originally from internal training) which we worked very hard to make correct. Is there a strong reason it won't be needed? In general I see people use as many num_stages as possible to reduce bubble size, which I don't see us encouraging -- we are almost encouraging the opposite.

Also, it's a good time to do some due diligence on PP feature parity with Megatron.
This is some PP config people use https://github.com/yanring/Megatron-MoE-ModelZoo/blob/main/model_configs/benchmarking/DeepSeek-V3.yaml#L136
Could you help understand this and make sure our PP could support it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did miss that num_layers_per_stage was no longer supported. I added support for this. Users can choose 3 options:

  1. use layers per stage (and we determine the num stages)
  2. use stage FQNs
  3. by default we determine the stage FQNs (this requires no extra configs from the user). So this refactor doesn't change the UX for the user.

Also, it's a good time to do some due diligence on PP feature parity with Megatron.

This is a good flag. I looked at what they're doing and it seems like they have a custom string format that they parse to determine their splitting, | denotes the split points with different chars representing modules. https://github.com/NVIDIA/Megatron-LM/blob/9327e1978d29b574d8db7a3d43121a2a94986b3f/docs/source/api-guide/pipeline_parallel_layout.md?plain=1#L5. Functionally, we are doing something similar albeit our module FQNs is more verbose, but it is also more general to handle more models/architectures. In our code, since the FQN generation is programmatic , users could also build off that (that is what TorchFT is going to do for their model chunking in streaming diloco)

Comment on lines 260 to 268
ft = job_config.fault_tolerance

if ft.enable:
if self.train_spec.ft_fragment_fn:
self.ft_model_parts = self.train_spec.ft_fragment_fn(model, job_config, model_args)
else:
self.ft_model_parts = [model]


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has come to a point where I don't think we should add more ft code into train.py. E.g. it seems here the code doesn't work with PP.

Maybe it's time to consider adding an async_train.py?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tianyu-l right they don't work together right now, that'll require some additional changes though we should support that eventually. most of the code will be copy pasted though with all the code blocks from if ft.enabled? if we do that it'll be hard to maintain both the files, we'll have to make sure all changes go to both files. maybe better to split off pipeline parallel to separate training file since the training for pp actually looks different. it'll still be some duplication though.

build_dataloader_fn: DataLoaderBuilder
build_tokenizer_fn: TokenizerBuilder | None
build_loss_fn: LossFunctionBuilder
ft_fragment_fn: FTFragmentFunction | None = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TrainSpec contains the most fundamental components in model training. Every model needs to define these components in order to claim "model support" in torchtitan.

It seems to me FT is bringing a new training paradigm (semi-sync / async training) which significantly deviates from traditional synchronous training.

I suggest we reconsider and take an approach which best reflects the complexity and difference, maybe with a new training script & abstractions (e.g. having SemiSyncTrainSpec inherit TrainSpec).

Also if the stuff is still in experimental phase, you could consider putting it in experiments folder instead of being blocked for too long.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a separate FaultTolerantTrainSpec

@tushar00jain
Copy link
Contributor

@tianyu-l this is my bad. these are supposed to be 2 different pr's as there are 2 commits in this pr. one was created by me to work on top of @H-Huang's diff to add ft functionality. not sure why github added my commits to @H-Huang's pr which is on his fork 🤷

@tushar00jain
Copy link
Contributor

@H-Huang can you remove my commit from your fork (not sure how it made it there) and force push your branch. that'll hopefully remove my commit from this pr

@tianyu-l
Copy link
Contributor

@tushar00jain oh I see. Got it. No worries!

@H-Huang H-Huang force-pushed the pp_config branch 2 times, most recently from 0f8749a to 1d1e4c6 Compare July 23, 2025 18:08
@H-Huang H-Huang requested a review from tianyu-l July 23, 2025 18:14
Example:
generate_llm_fqn_per_model_part(2, 3, input_weight=2, output_weight=2)
treats embeddings as 2 layers and norm+output as 2 layers for distribution
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did I misunderstand? Input weight is 1, but this says 2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the example input_weight=2 so the input embeddings are treated as equal to 2 transformer layers

This API creates pipeline stages based on specified module names for each stage.
Args:
whole_model: The complete model to be split
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In prep for upstreaming, we should document the assumptions on how this module should be written such that it's compatible with this utility.

  • forward function should tolerate deleted layers

  • init_weights should also

Do we require layers to be a moduledict?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point! Will add. The function supports any moduledict, modulelist, or regular module. ModuleDict and ModuleList both just add a "." to the FQN which we parse out. Though your question made we realize we don't support nested moduledicts/lists so will add a comment for that too

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general looks good, but there are still some loose ends. Please address.

Also, let's re-enable all the PP tests in https://github.com/pytorch/torchtitan/blob/main/tests/integration_tests.py disabled earlier.


# Calculate number of virtual stages needed (using ceiling division)
# This allows for unequal distribution where stages can differ by at most 1 layer
num_virtual_stages = math.ceil(num_layers / layers_per_stage)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this to be less satisfying than my previous approach https://github.com/pytorch/torchtitan/pull/1416/files#diff-774b5ff5ea1f19065fae05cb4a9d3823becfe1b98422e0a3db5354853ce2b4cfL70-L71

Let's say we have num_layers = 6 and layers_per_stage = 2, then num_virtual_stages would be 3.
But if pp degree is 4 there will be exception below.

https://github.com/pytorch/torchtitan/blob/main/tests/integration_tests.py#L285
CI seems not capturing this failure, maybe should modify to 8 GPU with layer_per_stage = 2.

However, my intention was to have input_weight = output_weight = 1 and have total effective layers to be 8, which is evenly divisible by layers_per_stage = 2 to put things on 4 pp ranks.

To fix this issue, I can see two approaches:

  1. put this logic back to the auto-split function generate_llm_fqn_per_model_part, with layers_per_stage being passed to that function
  2. expose input_weights and output_weights as configs, so that you have access to them here. If they don't sound generic enought, we can take the internal approach mentioned earlier -- less_layer_first_pp_stage less_layer_last_pp_stage

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted to expose input_weights and output_weights as configs and include them in my validation

https://github.com/pytorch/torchtitan/blob/main/tests/integration_tests.py#L285. CI seems not capturing this failure, maybe should modify to 8 GPU with layer_per_stage = 2.

Since this is interleaved1F1B for 4 ranks, that means there will be 8 stages, so 8 layers / 8 stages == 1 layer per stage.

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

almost good, one last concern

Comment on lines +207 to +230
for _ in range(effective_layers_for_stage):
if current_layer < num_layers:
stage_modules.append(f"layers.{current_layer}")
current_layer += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't fully get this comment #1416 (comment)

yeah on second thought, because we do some validation outside of this method i can remove this check

There are two levels of check.

  1. feasibility check
    At the very least we should assert the existence of >=1 layer in each PP stage?
    Think of the following imbalanced setup:
num_stages = 200
num_layers = 0
input_weights = 100
output_weights = 100

You'll get

layers_per_stage = 1
extra_layers = 0

Then we would end up that PP stage 0 has input, PP stage 99 has output, the other stages are empty.

  1. balance check

The reason we introduce input_weights output_weights is to balance the first & last stage, while keeping the other stages as balanced as possible.
Given that you want to allow for "slight imbalance" -- stages in the front can have x effective layers, stages in the back can have y effective layers, where y+1 >= x >= y -- we should check input_weights <= x and output_weights <= y.

Copy link
Member Author

@H-Huang H-Huang Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah i meant to say that there is validation when users pass in job_config.parallelism.pipeline_parallel_layers_per_stage to ensure that it makes sense for the model / schedule. Once we get into this method we don't do as much validation but I think the edge cases you mentioned are fair to point out. I added conditions for these:

https://github.com/pytorch/torchtitan/pull/1416/files#diff-774b5ff5ea1f19065fae05cb4a9d3823becfe1b98422e0a3db5354853ce2b4cfR168-R186

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the refactor!

@H-Huang H-Huang merged commit 3c84ce0 into pytorch:main Jul 30, 2025
8 checks passed
bentherien pushed a commit to bentherien/torchtitan_ that referenced this pull request Aug 5, 2025
This refactors the PP splitting logic to consolidate around settings
FQNs for each model chunk. For example:
```
[
  ['tok_embeddings', 'layers.0'], # stage0
  ['layers.1', 'layers.2'], # stage1
  ['layers.3', 'layers.4'], # stage2
  ... # so on...
]
```

This is better because it can generally be applied to all models, and
the code can be re-used for cases that don't explicitly require
pipelined execution (for example, streaming diloco needs to communicate
model chunks)

Changes:
- Refactor deepseekv3 and llama to share the same pipeline util
functions
- Add module_names_per_model_chunk config, deprecate
pipeline_parallel_split_points

TODO (follow up PRs):
- `pipeline_module_split` will be upstreamed to PyTorch as a
`torch.distributed.pipelining` utility since it contains no model
specific code.
- Additional changes are needed to get this to work for torchft
streaming diloco including updating the training loop to not execute if
the pipeline schedule isn't set and making sure the pipelining_fn return
the correct model chunks.

cc @tushar00jain
joellidin pushed a commit to one-covenant/torchtitan that referenced this pull request Aug 8, 2025
This refactors the PP splitting logic to consolidate around settings
FQNs for each model chunk. For example:
```
[
  ['tok_embeddings', 'layers.0'], # stage0
  ['layers.1', 'layers.2'], # stage1
  ['layers.3', 'layers.4'], # stage2
  ... # so on...
]
```

This is better because it can generally be applied to all models, and
the code can be re-used for cases that don't explicitly require
pipelined execution (for example, streaming diloco needs to communicate
model chunks)

Changes:
- Refactor deepseekv3 and llama to share the same pipeline util
functions
- Add module_names_per_model_chunk config, deprecate
pipeline_parallel_split_points

TODO (follow up PRs):
- `pipeline_module_split` will be upstreamed to PyTorch as a
`torch.distributed.pipelining` utility since it contains no model
specific code.
- Additional changes are needed to get this to work for torchft
streaming diloco including updating the training loop to not execute if
the pipeline schedule isn't set and making sure the pipelining_fn return
the correct model chunks.

cc @tushar00jain
joellidin pushed a commit to one-covenant/torchtitan that referenced this pull request Aug 8, 2025
This refactors the PP splitting logic to consolidate around settings
FQNs for each model chunk. For example:
```
[
  ['tok_embeddings', 'layers.0'], # stage0
  ['layers.1', 'layers.2'], # stage1
  ['layers.3', 'layers.4'], # stage2
  ... # so on...
]
```

This is better because it can generally be applied to all models, and
the code can be re-used for cases that don't explicitly require
pipelined execution (for example, streaming diloco needs to communicate
model chunks)

Changes:
- Refactor deepseekv3 and llama to share the same pipeline util
functions
- Add module_names_per_model_chunk config, deprecate
pipeline_parallel_split_points

TODO (follow up PRs):
- `pipeline_module_split` will be upstreamed to PyTorch as a
`torch.distributed.pipelining` utility since it contains no model
specific code.
- Additional changes are needed to get this to work for torchft
streaming diloco including updating the training loop to not execute if
the pipeline schedule isn't set and making sure the pipelining_fn return
the correct model chunks.

cc @tushar00jain
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants