Update stable branch with rocm_dev #92

sudhu2k · 2025-09-10T14:38:03Z

Motivation

To update the stable branch and to get latest features from the rocm_dev branch.

Test Plan

Unit tests will be run before PR Merge.

Test Result

N/A

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

…ad_stats_parallel_group.

Rename optimizer's model_parallel_group -> grad_stats_parallel_group. See merge request ADLR/megatron-lm!2240

Co-authored-by: Deepak Narayanan <[email protected]> Co-authored-by: Oliver Koenig <[email protected]> Co-authored-by: James Shen <[email protected]> Co-authored-by: Kirthi Shankar Sivamani <[email protected]> Co-authored-by: Keshav Santhanam <[email protected]> Co-authored-by: jasonwan <[email protected]>

Add support for PyTorch FSDP-2 See merge request ADLR/megatron-lm!2150

Update simple_text_generation_controller.py See merge request ADLR/megatron-lm!2345

…oder, encoder-decoder) to be compatible with all 3 TE backends Co-authored-by: Huy Vu2 <[email protected]> Co-authored-by: root <[email protected]>

Updating all T5 attention masks (encoder, decoder, encoder-decoder) to be compatible with all 3 TE backends See merge request ADLR/megatron-lm!2273

Co-authored-by: root <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: root <[email protected]>

Add hierarchical cp comm group See merge request ADLR/megatron-lm!2279

Add missing arg to save_checkpoint call See merge request ADLR/megatron-lm!2351

NVLM example scripts See merge request ADLR/megatron-lm!2306

ci: Re-enable llava tests See merge request ADLR/megatron-lm!2348

ci: Retry download assets See merge request ADLR/megatron-lm!2357

… ckpt-format when epp>1 Co-authored-by: Jon Barker <[email protected]>

Support etp==tp when epp==0 and enforce torch ckpt-format when epp>1 See merge request ADLR/megatron-lm!2260

Co-authored-by: Shanmugam Ramasamy <[email protected]>

QKNorm to work with TENorm See merge request ADLR/megatron-lm!2347

…alled

Support RMSNorm when TE and Apex are not installed See merge request ADLR/megatron-lm!2015

…logic

Clarifications for batch x pipeline parallel logic See merge request ADLR/megatron-lm!2343

…or TE cuDNN FusedAttention Co-authored-by: yaoyu-33 <[email protected]>

Add attention bias arg in MCore transformer for TE cuDNN FusedAttention See merge request ADLR/megatron-lm!2293

chore: Add mypy optionally See merge request ADLR/megatron-lm!2360

…_rocm [Fix] LayerNorm and RMSNorm not compatible on TE 1.13

…on_rocm Revert "[Fix] LayerNorm and RMSNorm not compatible on TE 1.13"

tests: add TENorm constructor tests

fix: support MoE models serving with EP

pinned TE and added pytest arg to make distributed test more stable

…ity (#84) * Updated iter count to ensure pytorch profiler stability * Included change into `train_llama2.sh`

Docker file fixes

* add keep_fp8_weight_transpose_cache to control memory allocation in TE * check TE class signature * update te min version for weight cache transpose * updated fsdp args in examples/train_llama

feat: add LoRA adapter layer and Mixtral LoRA training

Signed-off-by: Gene Der Su <[email protected]>

drop the need for GitHub token

lmcafee-nvidia and others added 30 commits November 13, 2024 22:06

ADLR/megatron-lm!2240 - Rename optimizer's model_parallel_group -> gr…

26b8b64

…ad_stats_parallel_group.

Merge branch 'lmcafee/distopt-doc-oct24' into 'main'

ae9c141

Rename optimizer's model_parallel_group -> grad_stats_parallel_group. See merge request ADLR/megatron-lm!2240

Merge branch 'boxiangw/fsdp2' into 'main'

4c4215f

Add support for PyTorch FSDP-2 See merge request ADLR/megatron-lm!2150

ADLR/megatron-lm!2345 - Update simple_text_generation_controller.py

229e225

Merge branch 'shanmugamr-main-patch-24278' into 'main'

8e22e5b

Update simple_text_generation_controller.py See merge request ADLR/megatron-lm!2345

ADLR/megatron-lm!2273 - Updating all T5 attention masks (encoder, dec…

c1728c1

…oder, encoder-decoder) to be compatible with all 3 TE backends Co-authored-by: Huy Vu2 <[email protected]> Co-authored-by: root <[email protected]>

Merge branch 'huvu/update_t5_attentionmasktype' into 'main'

2163865

Updating all T5 attention masks (encoder, decoder, encoder-decoder) to be compatible with all 3 TE backends See merge request ADLR/megatron-lm!2273

ADLR/megatron-lm!2279 - Add hierarchical cp comm group

645c329

Co-authored-by: root <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: root <[email protected]>

Merge branch 'add_hierarchical_cp_comm_group' into 'main'

2bdc60c

Add hierarchical cp comm group See merge request ADLR/megatron-lm!2279

ADLR/megatron-lm!2351 - Add missing arg to save_checkpoint call

8b72751

Merge branch 'jbarker-main-patch-72619' into 'main'

63b8520

Add missing arg to save_checkpoint call See merge request ADLR/megatron-lm!2351

ADLR/megatron-lm!2306 - NVLM example scripts

4131b07

Merge branch 'trintamaki/nvlm-example-scripts' into 'main'

ce507ee

NVLM example scripts See merge request ADLR/megatron-lm!2306

ADLR/megatron-lm!2348 - ci: Re-enable llava tests

9e9d4f5

Merge branch 'ko3n1g/ci/re-enable-mm-tests' into 'main'

6c88bfc

ci: Re-enable llava tests See merge request ADLR/megatron-lm!2348

ADLR/megatron-lm!2357 - ci: Retry download assets

06c67b4

Merge branch 'ko3n1g/ci/retry-download' into 'main'

5438d15

ci: Retry download assets See merge request ADLR/megatron-lm!2357

ADLR/megatron-lm!2260 - Support etp==tp when epp==0 and enforce torch…

57ed924

… ckpt-format when epp>1 Co-authored-by: Jon Barker <[email protected]>

Merge branch 'jbarker/etp_equals_tp' into 'main'

0f389f2

Support etp==tp when epp==0 and enforce torch ckpt-format when epp>1 See merge request ADLR/megatron-lm!2260

ADLR/megatron-lm!2347 - QKNorm to work with TENorm

62e2e33

Co-authored-by: Shanmugam Ramasamy <[email protected]>

Merge branch 'qknorm' into 'main'

68e11fb

QKNorm to work with TENorm See merge request ADLR/megatron-lm!2347

ADLR/megatron-lm!2015 - Support RMSNorm when TE and Apex are not inst…

693ae86

…alled

Merge branch 'torch-rms-norm' into 'main'

c4c9057

Support RMSNorm when TE and Apex are not installed See merge request ADLR/megatron-lm!2015

ADLR/megatron-lm!2343 - Clarifications for batch x pipeline parallel …

2e975f0

…logic

Merge branch 'helenn-fix-batch-pipeline-logic' into 'main'

2138248

Clarifications for batch x pipeline parallel logic See merge request ADLR/megatron-lm!2343

ADLR/megatron-lm!2293 - Add attention bias arg in MCore transformer f…

cd1d30b

…or TE cuDNN FusedAttention Co-authored-by: yaoyu-33 <[email protected]>

Merge branch 'yuya/add_attn_bias' into 'main'

6033e95

Add attention bias arg in MCore transformer for TE cuDNN FusedAttention See merge request ADLR/megatron-lm!2293

ADLR/megatron-lm!2360 - chore: Add mypy optionally

4f5aa6d

Merge branch 'ko3n1g/chore/add-mypy' into 'main'

f214627

chore: Add mypy optionally See merge request ADLR/megatron-lm!2360

mpashkovskii and others added 24 commits May 6, 2025 09:05

Merge branch 'ROCm:rocm_dev' into fix/moe-serving

283a579

Merge branch 'ROCm:rocm_dev' into feat/mixtral-lora

fcfa202

feat: add support of gzipped input datasets (#69)

a320613

Merge remote-tracking branch 'origin/rocm_dev' into pin_te

64b25c1

Merge pull request #75 from RuibinCheung/fix_te_113_not_compatible_on…

b4c56f9

…_rocm [Fix] LayerNorm and RMSNorm not compatible on TE 1.13

installing TE with -v

c42c8fb

Merge branch 'ROCm:rocm_dev' into feat/mixtral-lora

269e1b9

more optimized clone

7bc21e2

Revert "[Fix] LayerNorm and RMSNorm not compatible on TE 1.13"

824acaf

Merge pull request #79 from ROCm/revert-75-fix_te_113_not_compatible_…

6a27e3e

…on_rocm Revert "[Fix] LayerNorm and RMSNorm not compatible on TE 1.13"

Merge pull request #74 from mpashkovskii/tests/te-norm

10b7bc9

tests: add TENorm constructor tests

explicit handle of submodules

e37101b

Merge pull request #70 from mpashkovskii/fix/moe-serving

2bfccb4

fix: support MoE models serving with EP

Merge pull request #78 from ROCm/pin_te

f612bdf

pinned TE and added pytest arg to make distributed test more stable

Merge branch 'ROCm:rocm_dev' into feat/mixtral-lora

f840dea

fix: add new line at the end of the file

14948c0

tests: disable TP=2 tests because of memory leak

3a03f5d

Updated default TOTAL_ITERS count to ensure PyTorch profiler stabil…

38fc830

…ity (#84) * Updated iter count to ensure pytorch profiler stability * Included change into `train_llama2.sh`

Modified DockerFIle with fixes for Rocm6+

24a37df

Added transformers to CI dockerfile

822ddd2

Merge pull request #90 from ROCm/DockerFile_Fixes

4804918

Docker file fixes

FP8 Weight Transpose Cache ON/OFF (#86)

856c36d

* add keep_fp8_weight_transpose_cache to control memory allocation in TE * check TE class signature * update te min version for weight cache transpose * updated fsdp args in examples/train_llama

Merge branch 'ROCm:rocm_dev' into feat/mixtral-lora

3c9650a

Merge pull request #53 from mpashkovskii/feat/mixtral-lora

0bd0914

feat: add LoRA adapter layer and Mixtral LoRA training

sudhu2k requested review from wenchenvincent and zstreet87 September 10, 2025 14:38

sudhu2k self-assigned this Sep 10, 2025

GeneDer and others added 2 commits September 19, 2025 11:08

drop the need to use github token

a4e861c

Signed-off-by: Gene Der Su <[email protected]>

Merge pull request #93 from ROCm/genesu/drop-the-need-for-github-token

6bdf2be

drop the need for GitHub token

sudhu2k closed this Oct 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update stable branch with rocm_dev #92

Update stable branch with rocm_dev #92

Uh oh!

sudhu2k commented Sep 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

51 participants

Update stable branch with rocm_dev #92

Update stable branch with rocm_dev #92

Uh oh!

Conversation

sudhu2k commented Sep 10, 2025

Motivation

Test Plan

Test Result

Submission Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

51 participants