Skip to content

Conversation

@BoxiangW
Copy link
Contributor

@BoxiangW BoxiangW commented Sep 18, 2025

Features in this PR

  • Supports local muon
  • Supports distributed muon
  • Tested on Dense and MoE models

Need https://github.com/NVIDIA-NeMo/Emerging-Optimizers

Note: In this PR, there is a example script that uses a custom model (llama3 8b with 8 expert), please change the model structure if needed.

To use dist_muon or muon optimizer, please run

# on compute node
cd <workspace>
git clone https://github.com/NVIDIA-NeMo/Emerging-Optimizers.git
git clone https://github.com/BoxiangW/Megatron-LM.git
cd Megatron-LM/
git checkout boxiangw/llm-shower-repo
cd ..

bash Megatron-LM/muon.sh

To use example script, please

  1. Add your wandb api key and entity to the script (optional)
  2. Add your tokenizer model path and data path to the script
  3. Select your optimizer from [muon, dist_muon, adam], defaults to dist_muon

Signed-off-by: Boxiang Wang <[email protected]>
Signed-off-by: Boxiang Wang <[email protected]>
Signed-off-by: Boxiang Wang <[email protected]>
Signed-off-by: Boxiang Wang <[email protected]>
Signed-off-by: Boxiang Wang <[email protected]>
Signed-off-by: Boxiang Wang <[email protected]>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 18, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Boxiang Wang <[email protected]>
Signed-off-by: Boxiang Wang <[email protected]>
@BoxiangW BoxiangW self-assigned this Sep 18, 2025
Signed-off-by: Boxiang Wang <[email protected]>
Signed-off-by: Boxiang Wang <[email protected]>
Signed-off-by: Boxiang Wang <[email protected]>
@BoxiangW
Copy link
Contributor Author

Closing this PR in favor of https://github.com/NVIDIA/Megatron-LM/tree/dev branch.

@valentyn1boreiko
Copy link

Hi @BoxiangW, thank you for the implementation! Going through the code, it doesn't really support distributed Muon, does it? Also the distrib_optimizer.py is still mostly for Adam and is not touched in this PR.

@BoxiangW
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants