Skip to content

Conversation

shjwudp
Copy link
Contributor

@shjwudp shjwudp commented Aug 27, 2025

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Starting with M-Core v0.14, custom-fsdp was changed to megatron-fsdp, which introduced some API changes. This MR addresses these changes. The most significant change is checkpoint format change from M-Core torch_dist to fsdp_dtensor.

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

Selvaraj Anandaraj and others added 3 commits August 27, 2025 20:53
Signed-off-by: Selvaraj Anandaraj <[email protected]>
Signed-off-by: jianbinc <[email protected]>
@shjwudp shjwudp force-pushed the jianbinc/nvfsdp_update branch from 07e4c4e to 0c17d92 Compare August 27, 2025 12:53
@gautham-kollu gautham-kollu requested a review from BoxiangW August 27, 2025 19:55
BoxiangW
BoxiangW previously approved these changes Aug 27, 2025
Copy link
Collaborator

@BoxiangW BoxiangW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@@ -84,6 +84,49 @@ def override_recipe_configs(
recipe = set_exp_logging_configs(
recipe, "pre_train", "llm", "llama3", args.tensorboard, args.wandb, args.wandb_prj_name, args.wandb_job_name
)
# for saving checkpoints
ckpt_path = "/lustre/fsw/coreai_devtech_all/jianbinc/playground/nemo_nvfsdp_update/NeMo/checkpoints"
Copy link
Collaborator

@BoxiangW BoxiangW Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove this path

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks for the review!

formt code

Signed-off-by: jianbinc <[email protected]>
@@ -365,9 +365,9 @@ def __init__(

self._fsdp = None

if fsdp is None and self.ddp_config and self.ddp_config.use_custom_fsdp:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Customers who built training scripts/tutorials using this would break. We have should comms around it.

Can we should release-note documentation for this ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for point it, I've added use_custom_fsdp as a fallback so that users using older M-Core won't experience breaks.

Can I ask where should I add a release note?

def set_use_megatron_fsdp(recipe):
try:
recipe.trainer.strategy.ddp.use_megatron_fsdp = True
except:

Check notice

Code scanning / CodeQL

Except block handles 'BaseException' Note

Except block directly handles BaseException.

try:
recipe.trainer.strategy.ddp.keep_fp8_transpose_cache = False
except:

Check notice

Code scanning / CodeQL

Except block handles 'BaseException' Note

Except block directly handles BaseException.
)
try:
recipe.trainer.strategy.ddp.keep_fp8_transpose_cache = bool(keep_fsdp_fp8_transpose_cache)
except:

Check notice

Code scanning / CodeQL

Except block handles 'BaseException' Note

Except block directly handles BaseException.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants