-
Notifications
You must be signed in to change notification settings - Fork 603
Refactor PP splitting #1416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor PP splitting #1416
Conversation
|
@H-Huang should the config be in terms of FQN's of the split points instead of having to specify the whole model parameter FQN's? |
|
@tushar00jain Right now the model doesn't use any torchtitan/torchtitan/models/llama3/model/model.py Lines 337 to 352 in beb29a1
pipeline to split the model we can revisit this, but I believe when it was first tested there were some issues tracing the full model.
|
|
@H-Huang sure, btw should we also have a script to generate this list for llama models for debugging? since the list can be large, maybe put it in a file and have the torchtitan config read from a specified file? |
|
/gemini review |
|
@gemini-code-assist review |
torchtitan/config_manager.py
Outdated
| but currently the split points must be specified manually. | ||
| """ | ||
|
|
||
| module_names_per_model_chunk: list[list[str]] = field(default_factory=list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this go under a different section since it'll also be reused by ft? if both pp and ft are enabled, we use the same chunking for both right? or do we want to allow specifying it differently for both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's probably good to just keep the chunking for pp and ft separate. we'll need to figure how to combine them eventually where we split each pipeline chunk into diloco fragments
a11e9f5 to
c4d11b7
Compare
tianyu-l
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR is mixing two things: PP refactor and adding new FT capability. While they share some utilities, the complexity is very high. So I suggest we split into at least two PRs.
For PP -- it's not only refactor, but also trim some functionalities we added recently. I suggest we be careful on the consequence and also do some investigation on Megatron PP so that we don't miss opportunities.
For FT, I feel the complexity is worth new training scripts and abstractions, instead of piggyback on existing ones.
|
|
||
| module_names_per_model_chunk: list[list[str]] = field(default_factory=list) | ||
| """ | ||
| Specify a list of lists containing the FQNs (Fully Qualified Names) of modules for each model chunk. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can the fqns be given at arbitrary granularity? e.g. say split happens inside a TransformerBlock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's what makes it more flexible
torchtitan/distributed/pipeline.py
Outdated
| return stage_v_pairs[pp_rank] | ||
|
|
||
|
|
||
| def generate_module_names_per_stage( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems you are removing the option num_layers_per_stage (originally from internal training) which we worked very hard to make correct. Is there a strong reason it won't be needed? In general I see people use as many num_stages as possible to reduce bubble size, which I don't see us encouraging -- we are almost encouraging the opposite.
Also, it's a good time to do some due diligence on PP feature parity with Megatron.
This is some PP config people use https://github.com/yanring/Megatron-MoE-ModelZoo/blob/main/model_configs/benchmarking/DeepSeek-V3.yaml#L136
Could you help understand this and make sure our PP could support it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did miss that num_layers_per_stage was no longer supported. I added support for this. Users can choose 3 options:
- use layers per stage (and we determine the num stages)
- use stage FQNs
- by default we determine the stage FQNs (this requires no extra configs from the user). So this refactor doesn't change the UX for the user.
Also, it's a good time to do some due diligence on PP feature parity with Megatron.
This is a good flag. I looked at what they're doing and it seems like they have a custom string format that they parse to determine their splitting, | denotes the split points with different chars representing modules. https://github.com/NVIDIA/Megatron-LM/blob/9327e1978d29b574d8db7a3d43121a2a94986b3f/docs/source/api-guide/pipeline_parallel_layout.md?plain=1#L5. Functionally, we are doing something similar albeit our module FQNs is more verbose, but it is also more general to handle more models/architectures. In our code, since the FQN generation is programmatic , users could also build off that (that is what TorchFT is going to do for their model chunking in streaming diloco)
torchtitan/train.py
Outdated
| ft = job_config.fault_tolerance | ||
|
|
||
| if ft.enable: | ||
| if self.train_spec.ft_fragment_fn: | ||
| self.ft_model_parts = self.train_spec.ft_fragment_fn(model, job_config, model_args) | ||
| else: | ||
| self.ft_model_parts = [model] | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has come to a point where I don't think we should add more ft code into train.py. E.g. it seems here the code doesn't work with PP.
Maybe it's time to consider adding an async_train.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tianyu-l right they don't work together right now, that'll require some additional changes though we should support that eventually. most of the code will be copy pasted though with all the code blocks from if ft.enabled? if we do that it'll be hard to maintain both the files, we'll have to make sure all changes go to both files. maybe better to split off pipeline parallel to separate training file since the training for pp actually looks different. it'll still be some duplication though.
torchtitan/protocols/train_spec.py
Outdated
| build_dataloader_fn: DataLoaderBuilder | ||
| build_tokenizer_fn: TokenizerBuilder | None | ||
| build_loss_fn: LossFunctionBuilder | ||
| ft_fragment_fn: FTFragmentFunction | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TrainSpec contains the most fundamental components in model training. Every model needs to define these components in order to claim "model support" in torchtitan.
It seems to me FT is bringing a new training paradigm (semi-sync / async training) which significantly deviates from traditional synchronous training.
I suggest we reconsider and take an approach which best reflects the complexity and difference, maybe with a new training script & abstractions (e.g. having SemiSyncTrainSpec inherit TrainSpec).
Also if the stuff is still in experimental phase, you could consider putting it in experiments folder instead of being blocked for too long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created a separate FaultTolerantTrainSpec
|
@H-Huang can you remove my commit from your fork (not sure how it made it there) and force push your branch. that'll hopefully remove my commit from this pr |
|
@tushar00jain oh I see. Got it. No worries! |
0f8749a to
1d1e4c6
Compare
| Example: | ||
| generate_llm_fqn_per_model_part(2, 3, input_weight=2, output_weight=2) | ||
| treats embeddings as 2 layers and norm+output as 2 layers for distribution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did I misunderstand? Input weight is 1, but this says 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the example input_weight=2 so the input embeddings are treated as equal to 2 transformer layers
| This API creates pipeline stages based on specified module names for each stage. | ||
| Args: | ||
| whole_model: The complete model to be split |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In prep for upstreaming, we should document the assumptions on how this module should be written such that it's compatible with this utility.
-
forward function should tolerate deleted layers
-
init_weights should also
Do we require layers to be a moduledict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point! Will add. The function supports any moduledict, modulelist, or regular module. ModuleDict and ModuleList both just add a "." to the FQN which we parse out. Though your question made we realize we don't support nested moduledicts/lists so will add a comment for that too
tianyu-l
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general looks good, but there are still some loose ends. Please address.
Also, let's re-enable all the PP tests in https://github.com/pytorch/torchtitan/blob/main/tests/integration_tests.py disabled earlier.
|
|
||
| # Calculate number of virtual stages needed (using ceiling division) | ||
| # This allows for unequal distribution where stages can differ by at most 1 layer | ||
| num_virtual_stages = math.ceil(num_layers / layers_per_stage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this to be less satisfying than my previous approach https://github.com/pytorch/torchtitan/pull/1416/files#diff-774b5ff5ea1f19065fae05cb4a9d3823becfe1b98422e0a3db5354853ce2b4cfL70-L71
Let's say we have num_layers = 6 and layers_per_stage = 2, then num_virtual_stages would be 3.
But if pp degree is 4 there will be exception below.
https://github.com/pytorch/torchtitan/blob/main/tests/integration_tests.py#L285
CI seems not capturing this failure, maybe should modify to 8 GPU with layer_per_stage = 2.
However, my intention was to have input_weight = output_weight = 1 and have total effective layers to be 8, which is evenly divisible by layers_per_stage = 2 to put things on 4 pp ranks.
To fix this issue, I can see two approaches:
- put this logic back to the auto-split function
generate_llm_fqn_per_model_part, withlayers_per_stagebeing passed to that function - expose
input_weightsandoutput_weightsas configs, so that you have access to them here. If they don't sound generic enought, we can take the internal approach mentioned earlier --less_layer_first_pp_stageless_layer_last_pp_stage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opted to expose input_weights and output_weights as configs and include them in my validation
https://github.com/pytorch/torchtitan/blob/main/tests/integration_tests.py#L285. CI seems not capturing this failure, maybe should modify to 8 GPU with layer_per_stage = 2.
Since this is interleaved1F1B for 4 ranks, that means there will be 8 stages, so 8 layers / 8 stages == 1 layer per stage.
tianyu-l
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
almost good, one last concern
| for _ in range(effective_layers_for_stage): | ||
| if current_layer < num_layers: | ||
| stage_modules.append(f"layers.{current_layer}") | ||
| current_layer += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't fully get this comment #1416 (comment)
yeah on second thought, because we do some validation outside of this method i can remove this check
There are two levels of check.
- feasibility check
At the very least we should assert the existence of >=1 layer in each PP stage?
Think of the following imbalanced setup:
num_stages = 200
num_layers = 0
input_weights = 100
output_weights = 100
You'll get
layers_per_stage = 1
extra_layers = 0
Then we would end up that PP stage 0 has input, PP stage 99 has output, the other stages are empty.
- balance check
The reason we introduce input_weights output_weights is to balance the first & last stage, while keeping the other stages as balanced as possible.
Given that you want to allow for "slight imbalance" -- stages in the front can have x effective layers, stages in the back can have y effective layers, where y+1 >= x >= y -- we should check input_weights <= x and output_weights <= y.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah i meant to say that there is validation when users pass in job_config.parallelism.pipeline_parallel_layers_per_stage to ensure that it makes sense for the model / schedule. Once we get into this method we don't do as much validation but I think the edge cases you mentioned are fair to point out. I added conditions for these:
tianyu-l
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the refactor!
This refactors the PP splitting logic to consolidate around settings FQNs for each model chunk. For example: ``` [ ['tok_embeddings', 'layers.0'], # stage0 ['layers.1', 'layers.2'], # stage1 ['layers.3', 'layers.4'], # stage2 ... # so on... ] ``` This is better because it can generally be applied to all models, and the code can be re-used for cases that don't explicitly require pipelined execution (for example, streaming diloco needs to communicate model chunks) Changes: - Refactor deepseekv3 and llama to share the same pipeline util functions - Add module_names_per_model_chunk config, deprecate pipeline_parallel_split_points TODO (follow up PRs): - `pipeline_module_split` will be upstreamed to PyTorch as a `torch.distributed.pipelining` utility since it contains no model specific code. - Additional changes are needed to get this to work for torchft streaming diloco including updating the training loop to not execute if the pipeline schedule isn't set and making sure the pipelining_fn return the correct model chunks. cc @tushar00jain
This refactors the PP splitting logic to consolidate around settings FQNs for each model chunk. For example: ``` [ ['tok_embeddings', 'layers.0'], # stage0 ['layers.1', 'layers.2'], # stage1 ['layers.3', 'layers.4'], # stage2 ... # so on... ] ``` This is better because it can generally be applied to all models, and the code can be re-used for cases that don't explicitly require pipelined execution (for example, streaming diloco needs to communicate model chunks) Changes: - Refactor deepseekv3 and llama to share the same pipeline util functions - Add module_names_per_model_chunk config, deprecate pipeline_parallel_split_points TODO (follow up PRs): - `pipeline_module_split` will be upstreamed to PyTorch as a `torch.distributed.pipelining` utility since it contains no model specific code. - Additional changes are needed to get this to work for torchft streaming diloco including updating the training loop to not execute if the pipeline schedule isn't set and making sure the pipelining_fn return the correct model chunks. cc @tushar00jain
This refactors the PP splitting logic to consolidate around settings FQNs for each model chunk. For example: ``` [ ['tok_embeddings', 'layers.0'], # stage0 ['layers.1', 'layers.2'], # stage1 ['layers.3', 'layers.4'], # stage2 ... # so on... ] ``` This is better because it can generally be applied to all models, and the code can be re-used for cases that don't explicitly require pipelined execution (for example, streaming diloco needs to communicate model chunks) Changes: - Refactor deepseekv3 and llama to share the same pipeline util functions - Add module_names_per_model_chunk config, deprecate pipeline_parallel_split_points TODO (follow up PRs): - `pipeline_module_split` will be upstreamed to PyTorch as a `torch.distributed.pipelining` utility since it contains no model specific code. - Additional changes are needed to get this to work for torchft streaming diloco including updating the training loop to not execute if the pipeline schedule isn't set and making sure the pipelining_fn return the correct model chunks. cc @tushar00jain
This refactors the PP splitting logic to consolidate around settings FQNs for each model chunk. For example:
This is better because it can generally be applied to all models, and the code can be re-used for cases that don't explicitly require pipelined execution (for example, streaming diloco needs to communicate model chunks)
Changes:
TODO (follow up PRs):
pipeline_module_splitwill be upstreamed to PyTorch as atorch.distributed.pipeliningutility since it contains no model specific code.cc @tushar00jain