Fsdp tutorial update #961

coreyjadams · 2025-06-06T21:45:34Z

** NOT FOR RELEASE **

This is an overhaul of the FSDP tutorial. Let's bring it in after the release goes out.

PhysicsNeMo Pull Request

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.

Dependencies

…buted applications (NVIDIA#906) * Wrap DeviceMesh in quotes for typing hint, to protect older torch versions (NVIDIA#905) from compatibility issues. * Bumps torch version to >=2.4.0 to minimize support surface for distributed applications. * Adds changelog note * Merge SongUNetPosLtEmb with SongUNetPosEmb and add support for batch>1 (NVIDIA#901) * mult-gpu training supported corrdiff optimization * enable mixed precision for val * clean codebase for opt * add amp_mode aware model architecture * add None checking for params * revise datatype casting schema * Add test cases for corrdiff optimizations Signed-off-by: Neal Pan <[email protected]> * revised from_checkpoint, update tests and CHANGELOG Signed-off-by: jialusui1102 <[email protected]> * Lint and format code properly Signed-off-by: Neal Pan <[email protected]> * add multi-gpu optimization * rebase changes and update tests and configs Signed-off-by: jialusui1102 <[email protected]> * merge ResidualLoss and refactored layer and Unet init based on PR review Signed-off-by: jialusui1102 <[email protected]> * Update layers.py with robust apex import * address incompatibility between dynamo and patching, retain same optimization perf w torch.compile Signed-off-by: jialusui1102 <[email protected]> * update tests Signed-off-by: jialusui1102 <[email protected]> * update changelog Signed-off-by: jialusui1102 <[email protected]> * initialize global_index directly on device Signed-off-by: jialusui1102 <[email protected]> * formatting Signed-off-by: jialusui1102 <[email protected]> * fix loss arguments in train.py Signed-off-by: jialusui1102 <[email protected]> * merge songunetposembd with songuneyposltembd with index slicing (recompile issue persists) Signed-off-by: jialusui1102 <[email protected]> * fix small errors in songunet Signed-off-by: jialusui1102 <[email protected]> * revise positional_embedding_indexing to avoid recompile/graph break and with faster bw comparing to old version Signed-off-by: jialusui1102 <[email protected]> * update changelog Signed-off-by: jialusui1102 <[email protected]> * add back SongUNetPosLtEmbd class for better ckp loading Signed-off-by: jialusui1102 <[email protected]> * add forward in SongUnetLtPosEmbd and update train.py Signed-off-by: jialusui1102 <[email protected]> * update test for lt model Signed-off-by: jialusui1102 <[email protected]> * update comments for embedding_selector test for lt model Signed-off-by: jialusui1102 <[email protected]> * update doctest Signed-off-by: jialusui1102 <[email protected]> * Added tiny detail in corrdiff readme Signed-off-by: Charlelie Laurent <[email protected]> * minor update to arguments and docstring Signed-off-by: jialusui1102 <[email protected]> --------- Signed-off-by: Neal Pan <[email protected]> Signed-off-by: jialusui1102 <[email protected]> Signed-off-by: Charlelie Laurent <[email protected]> Co-authored-by: Alicia Sui <[email protected]> Co-authored-by: Neal Pan <[email protected]> Co-authored-by: Charlelie Laurent <[email protected]> Co-authored-by: Charlelie Laurent <[email protected]> * Update CHANGELOG.md Fix lint error --------- Signed-off-by: Neal Pan <[email protected]> Signed-off-by: jialusui1102 <[email protected]> Signed-off-by: Charlelie Laurent <[email protected]> Co-authored-by: Corey adams <[email protected]> Co-authored-by: Jialu (Alicia) Sui <[email protected]> Co-authored-by: Alicia Sui <[email protected]> Co-authored-by: Neal Pan <[email protected]> Co-authored-by: Charlelie Laurent <[email protected]> Co-authored-by: Charlelie Laurent <[email protected]>

* fixing model.py to make comapatible with NIM * adding freq buffer to ParameterModel * formatting --------- Co-authored-by: Rishi Ranade <[email protected]> Co-authored-by: Mohammad Amin Nabian <[email protected]>

* Make sure that gpu processing and output settings are configurable. Set sensible fdefaults in the example config * Make sure that gpu processing and output settings are configurable. Set sensible fdefaults in the example config

* make dali optional * update Changelog

…ing cuda is available. (NVIDIA#943)

* update to make it compatible for windows * update darcy fno to minimize the dependencies to make it very light-weight and hello-worldy * use pathlib * lint * updates to checkpoint loading

* updating readme * Adding prerequisites section * fixing ci issues * linting --------- Co-authored-by: Kaustubh Tangsali <[email protected]> Co-authored-by: Kaustubh Tangsali <[email protected]>

Fix broken ShardTensor link.

… samples (NVIDIA#949) * add requirements.txt for bloodflow and deforming plate * move diffusion example (NVIDIA#930) * move diffusion example * update broken links * add requirements for flow reconstruction

* Add datapipes docs. * Fix class names.

… curation steps (NVIDIA#953) Co-authored-by: Kaustubh Tangsali <[email protected]>

* update logging, launch, utils api docs with added descriptions and examples * update introductory tutorial for typos and added clarity

* Adding first half of torch compile tutorial. * fixes to formatting and syntax * Add second half of torch.compile tutorial. * Clean up organization of performance docs. * Minor clean up on perf table teasers * remove all but IO section * Fix typos in torch compile tutorial

* add tutorial on physics informing * add geometry stuff * fix typos * add some opening text to index.rst * add summary * typos * address feedback * address feedback * add Ram's changes --------- Co-authored-by: Peter Sharpe <[email protected]>

…ed into the docs.

* update lr_decay_rate to be configurable Signed-off-by: jialusui1102 <[email protected]> * update lr_decay_rate comment Signed-off-by: jialusui1102 <[email protected]> --------- Signed-off-by: jialusui1102 <[email protected]>

* update license header checks to only check for commited files

NVIDIA#998) * Implements basic logic and safety for overridable args for the __init__ Signed-off-by: Charlelie Laurent <[email protected]> * Implements overridable arguments for diffusion models used in corrdiff. Refactored non-overridable args into properties that can be changed dynamically Signed-off-by: Charlelie Laurent <[email protected]> * Modified corrdiff train.py with refactored overridable APIs Signed-off-by: Charlelie Laurent <[email protected]> * Added tests for profile_mode properties of model wrappers Signed-off-by: Charlelie Laurent <[email protected]> * Added test for argument overrides in from_checkpoint Signed-off-by: Charlelie Laurent <[email protected]> * Updated CHANGELOG Signed-off-by: Charlelie Laurent <[email protected]> * Fixed typo Signed-off-by: Charlelie Laurent <[email protected]> * Fixed file deletion in from_checkpoint tests Signed-off-by: Charlelie Laurent <[email protected]> * Added precision regarding precedence of values saved in the state-dict. Modified tests accordingly Signed-off-by: Charlelie Laurent <[email protected]> * Added new test for from_checkpoint args override with UNet wrapper. Added some minor type hints Signed-off-by: Charlelie Laurent <[email protected]> * Disabled from_checkpoint test for UNet wrapper. To be enabled after GroupNorm is refactored Signed-off-by: Charlelie Laurent <[email protected]> --------- Signed-off-by: Charlelie Laurent <[email protected]>

… or not perfectly divisible (NVIDIA#996) * Fixing edge case with num_channels < min_channels_per_group * Raise error if groupnorm sizes don't match

* Massive refactor on domino utils.py to improve code quality * Adds missing tensorboard requirement * Fixes missing cuml requirement * Begins process of fixing inference_on_stl.py * Fixes outdated type definition * black formatting pass * Fixes import order * black formatting * Reshape accepts a shape, not a splatted iterable * Fixes lost array axis * Enhances docstrings in utils.py with examples and improved clarity; removes outdated examples. * Enhances area_weighted_shuffle_array function by adding area_factor parameter for adjustable sampling bias; updates docstring with detailed explanation and examples. * Updates docstrings in utils.py for accuracy and clarity; modifies examples in calculate_center_of_mass, standardize, nd_interpolator, pad, and pad_inp functions; adjusts k-nearest neighbors parameter in nd_interpolator for flexibility; corrects boolean checks in pad and pad_inp examples. * black format * Add test suite for domino utils module This commit introduces a new test file `test_domino_utils.py` that includes comprehensive unit tests for various functions in the domino utils module. Each test verifies the functionality of the corresponding utility function using examples from the documentation, ensuring correctness and reliability. * Refactor array_type function to handle CuPy import gracefully and optimize area_weighted_shuffle_array for consistent array handling. Remove redundant test for array_type. * Import PyVista conditionally in extract_surface_triangles function to avoid unnecessary dependency loading. * black formatting * Remove unused import

…de fixes (NVIDIA#973) * clarifies I/O in domino train.py * Gives paths in config.yaml user-agnostic pathnames * Switches from relu -> gelu to allow smooth gradients * Adds initial commit for design sensitivities study * Corrects outdated type hint * Refactors parameters in signed_distance_field calls for clarity * Refactors directory handling in create_directory and get_filenames functions to use pathlib for improved readability and functionality. Updates type hints to support both str and Path types. * Deletes merge(); this function is (a) not used anywhere, (b) can be replaced simply by the built-in sum(lists), and (c) as-written will always create an error, since `newlist` is a tuple and hence does not have a .extend() method. * black formatting * Code quality improvements * Replaces 'axis' with 'dim' in torch.cat calls for correctness with PyTorch documentation in GeoProcessor, GeometryRep, NNBasisFunctions, ParameterModel, and DoMINO classes. * Adds initial changes for DoMINO sensitivity * Refactors DesignDatapipe and DoMINOInference for improved readability and performance; updates type hints and formatting, and modifies input handling for mesh data. * Refactors DesignDatapipe to directly use STL centers for geometry coordinates; updates DoMINOInference to improve memory management and adds detailed docstrings for clarity. * Enhances DesignDatapipe by updating bounding box type hints, improving random sampling, and adding detailed docstrings for initialization and item retrieval methods. * Implements Laplacian smoothing for mesh data in a new utility function; updates DoMINOInference to utilize the new smoothing function and modifies sensitivity calculations accordingly. Enhances type hints and formatting for clarity. * Adds numba to requirements for improved performance in sensitivity analysis * Adds sbatch_logs/ to .gitignore to exclude SLURM batch log files from version control. * Adds compute-optimized mesh_postprocessing utilities * Working `main.py` with abstracted postprocessing step * formatting * Refactors main.py to remove duplicate STL combining function and streamline input handling. Updates input file processing and enhances results storage for mesh data. * Commits configuration files for sensitivity studies * Adds requirements.txt * Adds raw and smooth drag gradient data files, and implements a plotting script for gradient checking. * Refactors import statements in main.py for consistency and clarity. Streamlines input file path construction. * Creates main_gradient_checking.py for drag gradient checking using DoMINOInference, including sensitivity analysis and output to text files. * Updates file paths in main_gradient_checking.py and plot_gradient_checking.py to save output data in a dedicated gradient_checking_results directory. Adds new raw and smooth drag gradient data files. * Adds a new aerodynamics example using DoMINO to compute design sensitivities (e.g., drag adjoint) with respect to underlying input geometry in CHANGELOG.md. * Add README.md for DoMINO sensitivity analysis pipeline, detailing usage, features, and configuration for aerodynamic design optimization. * black formatting fixes * Add SPDX license headers to plot_gradient_checking.py * Fixes markdownlint * Removes unused import * Updates license year * Fixes license year * Removes unused main block sections * Removes erroneous uv.lock commit * Removes some optimization language * Remove unnecessary cached yaml * Refactors to not require separate config (instead pulling it from DoMINO), as well as eliminating relative paths * Add warning for loading model without checkpoint in DoMINOInference * Add verbose option to DoMINOInference for memory usage logging * Refactor imports in design_datapipe.py for clarity and efficiency; remove unused imports and reorganize necessary ones. * Refactor DesignDatapipe to use NearestNeighbors from cuML for neighbor finding; update input handling in DoMINOInference for improved tensor management and type consistency. * Enhance DesignDatapipe to accept a device parameter for tensor management; update tensor creation in DoMINOInference for improved efficiency and consistency. * Readme cleanup * Replace GELU activation with a configurable activation function in GeoProcessor. * formatting * remove duplicate section * Makes activations configurable * formatting * add license

* Add PyG version of VortexShedding example and VortexSheddingDataset * Replace Union type hints with an alias. Add MeshNodeBlock tests. * Add distributed sampler to the example. Add MeshEdgeBlock test. Fix DGL inference script. * Fix VortexShedding PyG inference script * Add MGN DGL2PYG tests. * Update inference notebooks * Make linter happy. * Fix test. * Update req.txt. Clean up TODO * Address review feedback. * Update README * Add proper epoch loss reporting * Address review feedback. * Require DGL or PyG only when necessary

* Add correctness test for deterministic ssampler * lint * drop np dep

…#1012) * Removed unecessary check in args overriding Signed-off-by: Charlelie Laurent <[email protected]> * Replaced exception with warning in argument overriding Signed-off-by: Charlelie Laurent <[email protected]> --------- Signed-off-by: Charlelie Laurent <[email protected]>

* Use e2grid healpixpad when possible * Drop unused imports * changelog * formatting

* address vdr comments * fix lint * fix lint --------- Co-authored-by: root <[email protected]>

* Migrate Vortex Shedding Reduced Mesh example to PyG * Update CHANGELOG

…lobal parameters input (NVIDIA#903) * changes based on updated main branch * update to model.py and end to end testing * changes to sharded parts of the code * Update README * Update inference_on_stl.py to comply with new method * minor refactor * update * Tested training * remove hardcoded stuff from inference_on_stl.py * Removed comments from model.py * Remove air_density and stream_velocity from domino_sharded * Remove comments from domino_datapipe * Removed names and make paths generic * make encode_parameters false * Update and remove comments * Update README * Update README, remove redundant text * Update model.py to remove air_density and stream_velocity * Update inference_on_stl.py to be consistent with main * Update README.md to be compliant with main * Update tests * changes based on CI * small cleaning config.yaml * Update changelog * fixing doctest issue --------- Co-authored-by: Peter Sharpe <[email protected]> Co-authored-by: Rishikesh Ranade <[email protected]> Co-authored-by: RishikeshRanade <[email protected]>

…A#1000) * make dimensions consistent for checkpointing * add use_reentrant=False to checkpoint in songuent for torch.compile support * removed use_patch_grad_acc from loss_valid_kwargs in corrdiff train.py script as the regression loss does not support it * set graph static for corrdiff training to enable checkpoint * change the checkpoint reference dimension from x to y as it is the same dimension used to name the layers * correct positional embedding in song unet * correct embedding for gridtype==test and N_grid_channels==2 * Change single dimension shape with geometric mean to use checkpointing * reformatted --------- Co-authored-by: Charlelie Laurent <[email protected]>

…t gracefully (NVIDIA#1019)

…g qkv and added inference optimization and fixes (NVIDIA#954) * restructured attention into separate class and fix errors in reshaping qkv Signed-off-by: jialusui1102 <[email protected]> * update CHANGELOG Signed-off-by: jialusui1102 <[email protected]> * revert earlier changes in train.py Signed-off-by: jialusui1102 <[email protected]> * add multiple inference optimization for CorrDiff Signed-off-by: jialusui1102 <[email protected]> * minor update Signed-off-by: jialusui1102 <[email protected]> * add attention ckp conversion and restructure use_fp16 logistics Signed-off-by: jialusui1102 <[email protected]> * update unit tests for fp16 Signed-off-by: jialusui1102 <[email protected]> * Minor formatting to the Attention docstring Signed-off-by: Charlelie Laurent <[email protected]> * Removed private attribute _use_fp16 initialization in UNet end EDMPrecondSuperResolution Signed-off-by: Charlelie Laurent <[email protected]> * Made overlap_count a private argument in patching and the method _get_overlap_count a private method Signed-off-by: Charlelie Laurent <[email protected]> * Added non-regression test for GridPatching2D and get_overlap_count method Signed-off-by: Charlelie Laurent <[email protected]> * Added API doc for use_fp16 method in UNet wrapper Signed-off-by: Charlelie Laurent <[email protected]> * Added docs for overlap_count argument in image_fuse Signed-off-by: Charlelie Laurent <[email protected]> * Removed utils subdirectory in tests Signed-off-by: Charlelie Laurent <[email protected]> * Fixed some pytest package confusion in utils testing Signed-off-by: Charlelie Laurent <[email protected]> * restructure get_overlap_count() as a static method and update related unit tests Signed-off-by: jialusui1102 <[email protected]> * Minor formatting in docstring for get_overlap_count Signed-off-by: Charlelie Laurent <[email protected]> * Minor detail in docstring for image_fuse Signed-off-by: Charlelie Laurent <[email protected]> * Changed exepct path for non-regression reference data used in test_patching Signed-off-by: Charlelie Laurent <[email protected]> * only do attn ckp conversion for UNet based models Signed-off-by: jialusui1102 <[email protected]> * add comment to move attn ckp conversion to classes later Signed-off-by: jialusui1102 <[email protected]> * Consistently set stochastic sampler precision to float32 Signed-off-by: jialusui1102 <[email protected]> * Moved attention module conversion to UNetBlcok load_state_dict method Signed-off-by: Charlelie Laurent <[email protected]> * Minor renaming in UNetBlock Signed-off-by: Charlelie Laurent <[email protected]> * Simplified warning logic for attention module's keys mapping Signed-off-by: Charlelie Laurent <[email protected]> * Updated corrdiff train and generate recipes with overridable args Signed-off-by: Charlelie Laurent <[email protected]> * Added validation to make sure amp_mode is disabled when torch.autocast is disabled Signed-off-by: Charlelie Laurent <[email protected]> * Implemented automated channels_last layout in SongUNet when using use_apex_gn Signed-off-by: Charlelie Laurent <[email protected]> * Fix CI: added attribute use_apex_gn to SongUNet Signed-off-by: Charlelie Laurent <[email protected]> * Refactored amp_mode and profile_mode properties for SongUNets and their wrappers Signed-off-by: Charlelie Laurent <[email protected]> * Added two distinct shape-specific apply_wrapper in stochastic sampler Signed-off-by: Charlelie Laurent <[email protected]> * Updated tests to be compatible with the modified amp_mode API Signed-off-by: Charlelie Laurent <[email protected]> * Fix pytorch deprecation warning for is_autocast_enabled Signed-off-by: Charlelie Laurent <[email protected]> * Implemented property factory for amp_mode and profile_mode in model wrappers + added them to StormCastUNet to pass CI tests Signed-off-by: Charlelie Laurent <[email protected]> * Updated CI tests for diffusion models Signed-off-by: Charlelie Laurent <[email protected]> * resolve conflicts between cpu and apex and update related CI Signed-off-by: jialusui1102 <[email protected]> * resolve recompile errors for stochastic sampler in CICD Signed-off-by: jialusui1102 <[email protected]> * Updated CHANGELOG.md Signed-off-by: Charlelie Laurent <[email protected]> * Updated CHANGELOG.md Signed-off-by: Charlelie Laurent <[email protected]> * Updated CHANGELOG.md Signed-off-by: Charlelie Laurent <[email protected]> * Some comments in SongUNets Signed-off-by: Charlelie Laurent <[email protected]> * Updated docs with amp_mode and profile_mode APIs Signed-off-by: Charlelie Laurent <[email protected]> --------- Signed-off-by: jialusui1102 <[email protected]> Signed-off-by: Charlelie Laurent <[email protected]> Co-authored-by: Charlelie Laurent <[email protected]> Co-authored-by: Charlelie Laurent <[email protected]>

* fixed grid effect * added data filter * added data filter * updated comment --------- Co-authored-by: Oliver Hennigh <[email protected]>

…IA#982) * Fix regression output shape * Only use act if fused_act is True * Avoid dtype change of buffer/param and fix softmax dtype * Added unit tests for song unet models with learnable positional embedding, lead time aware, with compile, apex_gn, etc... Signed-off-by: Charlelie Laurent <[email protected]> * Updated tests for SongUNetPosLtEmbd with AMP, Apex GN and compile Signed-off-by: Charlelie Laurent <[email protected]> * Renamed variable in SongUNetPOsEmbd Signed-off-by: Charlelie Laurent <[email protected]> * Revert bug introduced in SongUNetPosEmbd positional_embedding_selector Signed-off-by: Charlelie Laurent <[email protected]> * Reverted test script to its original state Signed-off-by: Charlelie Laurent <[email protected]> * Fixed some new CI tests Signed-off-by: Charlelie Laurent <[email protected]> * Added missing parameter in new tests Signed-off-by: Charlelie Laurent <[email protected]> * Added dtype casting in SongUNetPosEmbd forward Signed-off-by: Charlelie Laurent <[email protected]> * Fixed number of channels in new tests Signed-off-by: Charlelie Laurent <[email protected]> * Added random seed in new tests Signed-off-by: Charlelie Laurent <[email protected]> * Added more missing random seeds to new tests Signed-off-by: Charlelie Laurent <[email protected]> * Removed some random seeds added by mistake in new tests Signed-off-by: Charlelie Laurent <[email protected]> --------- Signed-off-by: Charlelie Laurent <[email protected]> Co-authored-by: Julius Berner <[email protected]> Co-authored-by: Charlelie Laurent <[email protected]> Co-authored-by: Charlelie Laurent <[email protected]>

* first commit * add README.md * add README.md * add README.md * revise for 2nd round code review * revise for 2nd round code review * CHANGELOG update for TopoDiff * code reivew for merge * code review * add command to run the model * add command to run the model * add command to run the model * add command to run the model * avoid floating material in generation * avoid floating material in generation * topodiff merge * topodiff merge * topodiff merge * topodiff merge * Topodiff merge * Topodiff merge * Topodiff merge * Topodiff merge * formatting * .formatting, name change * fix bugs, cleanup * fix pydantic --------- Co-authored-by: Mohammad Amin Nabian <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: root <[email protected]>

* adding moe * address review comments, update readme * Small bug fix for preprocessor * address review comments --------- Co-authored-by: root <[email protected]> Co-authored-by: root <[email protected]>

* fixed grid effect * uv fix * blaa * removed nemo build * added unmannaged --------- Co-authored-by: Oliver Hennigh <[email protected]>

* Refactor signed_distance_field function in sdf.py for improved clarity and performance. Update parameter types to use np.ndarray and cp.ndarray, enhance docstring with detailed descriptions and examples, and streamline array conversion logic. * Optimize memory allocation in signed_distance_field function by using wp.empty instead of wp.zeros. Update array dimensions for kernel launch and streamline return logic. * Enhance docstring in signed_distance_field function to clarify parameters and return types, including GPU acceleration details and usage of sign winding number method. Remove unnecessary blank line. * Enhance docstring in signed_distance_field function to provide clearer explanation of the 'include_hit_points' parameter, specifying its role in defining the SDF. * formatting * Fix formatting inconsistencies in docstring of signed_distance_field function in sdf.py. * Adds fix for back-compatibility with input_points arrays with incorrect shape

…o-CFD (NVIDIA#1032)

* Added experimental tEDMPrecondSuperRes Signed-off-by: Charlelie Laurent <[email protected]> * Some refactors in diffusion ResidualLoss to accomodate t-EDM subclass Signed-off-by: Charlelie Laurent <[email protected]> * Added experimental t-EDM loss Signed-off-by: Charlelie Laurent <[email protected]> * Added warning message when importing from physicsnemo.experimental Signed-off-by: Charlelie Laurent <[email protected]> * Some fixes in docstrings Signed-off-by: Charlelie Laurent <[email protected]> * Added student-t distribution in StackedRandomGenerator Signed-off-by: Charlelie Laurent <[email protected]> * Added t-student option in corrdiff diffusion_step Signed-off-by: Charlelie Laurent <[email protected]> * Added t-student distribution option in corrdiff generate.py Signed-off-by: Charlelie Laurent <[email protected]> * Updated warning message for student-t distribution Signed-off-by: Charlelie Laurent <[email protected]> * Corrected wrong import in experimental diffusion metrics Signed-off-by: Charlelie Laurent <[email protected]> * Added t-student distribution option in CorrDiff train.py Signed-off-by: Charlelie Laurent <[email protected]> * Minor string modif Signed-off-by: Charlelie Laurent <[email protected]> * Some minor renaming and reformating Signed-off-by: Charlelie Laurent <[email protected]> * Updated CHANGELOG.md Signed-off-by: Charlelie Laurent <[email protected]> * Added another safety check to CorrDiff generate.py Signed-off-by: Charlelie Laurent <[email protected]> * Added tests for t-EDM models, metrics and utils Signed-off-by: Charlelie Laurent <[email protected]> * Moved t-EDM tests to existing directories Signed-off-by: Charlelie Laurent <[email protected]> * Some fixes in t-edm tests Signed-off-by: Charlelie Laurent <[email protected]> * Fixed missing device in diffusion_step Signed-off-by: Charlelie Laurent <[email protected]> * Added a few missing docstrings for StackedRandomGenerator Signed-off-by: Charlelie Laurent <[email protected]> * Changed default value of P_mean to 0 in t-EDM loss Signed-off-by: Charlelie Laurent <[email protected]> * Made P_mean and P_std configurable in CorrDiff train.py and generate.py Signed-off-by: Charlelie Laurent <[email protected]> * Updated CHANGELOG.md to document configurable P_mean and P_std Signed-off-by: Charlelie Laurent <[email protected]> * A few fixes in CorrDiff Signed-off-by: Charlelie Laurent <[email protected]> --------- Signed-off-by: Charlelie Laurent <[email protected]>

…1035) * Bumps ruff from 0.0.290 to 0.12.5. Removes black, which is superseded by ruff-format. * Refactor ruff configuration in pyproject.toml to use non-deprecated settings * Migrates pre-commit settings to repo-wide settings * Replaces black with ruff-format in Makefile and updates linting commands to use ruff-check. * Adds Ruff note to Changelog * Update CONTRIBUTING.md to reflect changes in CI checks, replacing black with ruff for formatting and linting instructions. * Avoids acronyms * Adds docs about Ruff * Markdownlint fixes * Implements Ruff safe fixes * Adds hand-written fixes for lint errors * Refactors _check_checkpoint to remove duplicate code * Addresses Ruff lint issues with tarfile.extractall(), which appropriate modifications for back-compatibility with Python < 3.12.

* add patching support for determinstic sampler * code cleanup and unit test update * use patching wraper and fix pytest functions * change utils.generative to utils.diffusion * set default to torch.float64 * do compilation in determinstic sampler * update * Identified and fixed critical bug in stochastic_sampler and deterministic_sampler Signed-off-by: Charlelie Laurent <[email protected]> * Format CHANGELOG.md Signed-off-by: Charlelie Laurent <[email protected]> * Implements wrapper selector to fix compile issues in tests Signed-off-by: Charlelie Laurent <[email protected]> --------- Signed-off-by: Charlelie Laurent <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Charlelie Laurent <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Charlelie Laurent <[email protected]>

* resolving merge conflicts with main * fixing bugs * fixing CI errors * fixing merge conflicts in config * modifying Changelog * Update config.yaml * cpu processing in area_weighted_sampling * fixing naming issue in domino_datapipe.py * Update physicsnemo/models/domino/model.py Co-authored-by: Peter Sharpe <[email protected]> * Update physicsnemo/models/domino/model.py Co-authored-by: Peter Sharpe <[email protected]> * Update physicsnemo/models/domino/model.py Co-authored-by: Peter Sharpe <[email protected]> * Update physicsnemo/models/domino/model.py Co-authored-by: Peter Sharpe <[email protected]> * Update physicsnemo/models/domino/model.py Co-authored-by: Peter Sharpe <[email protected]> * Update examples/cfd/external_aerodynamics/domino/src/conf/config.yaml Co-authored-by: Peter Sharpe <[email protected]> * Update physicsnemo/models/domino/model.py Co-authored-by: Peter Sharpe <[email protected]> * Update examples/cfd/external_aerodynamics/domino/src/train.py Co-authored-by: Peter Sharpe <[email protected]> * fixing PR comments * addressing PR comments * fixing CI issues * fixing pytest issues in utils --------- Co-authored-by: Peter Sharpe <[email protected]>

* Add generic neighbor finding function that is suitable to use in FigConvNet, DoMINO, and mesh graph data pipes. * Fix an illegal device access when using multiple GPUs. * Performance tuning of neighbor query * Add warp-enabled radius search. Also add testing. * Update neighbor search tools to ensure we use 0 as the null index instead of -1 * Switch domino to use the new radius search function instead of ball query. This is functionally the same, though shows a performance enhancement. * Remove neighborlist function. Replaced with radius_search. * Using typing for annotations for CI * Update examples/minimal/neighbor_list/warp_neighbor_list.py Co-authored-by: Peter Sharpe <[email protected]> * Address nits and minor comments from PR review. * Relocate radius search code. * Remove old folders; goes with previous commit. * Update test import. * The CI container does not accept list[int] as an acceptable type for pytorch. * Make sure radius search is exported as a function, not a module. * Fixing formatting, since the linter appears to have changed .... * Remove cuda opcheck test temporarily --------- Co-authored-by: Peter Sharpe <[email protected]>

ktangsali and others added 28 commits May 21, 2025 18:58

update version

1625598

fixing model.py to make comapatible with NIM (NVIDIA#908)

ab26c81

* fixing model.py to make comapatible with NIM * adding freq buffer to ParameterModel * formatting --------- Co-authored-by: Rishi Ranade <[email protected]> Co-authored-by: Mohammad Amin Nabian <[email protected]>

Bug fix configure domino pipeline device (NVIDIA#921)

fc7e8f7

* Make sure that gpu processing and output settings are configurable. Set sensible fdefaults in the example config * Make sure that gpu processing and output settings are configurable. Set sensible fdefaults in the example config

Make NVIDIA Dali optional (NVIDIA#942)

1b13fa9

* make dali optional * update Changelog

Remove bare cuda usage and protect automatic op registration by check…

f3fa91a

…ing cuda is available. (NVIDIA#943)

Fix issues with readmes (NVIDIA#935)

460b1e2

Updates for Windows (NVIDIA#945)

c796885

* update to make it compatible for windows * update darcy fno to minimize the dependencies to make it very light-weight and hello-worldy * use pathlib * lint * updates to checkpoint loading

Fix typos (NVIDIA#946)

1c39f00

Updating DoMINO Readme (NVIDIA#941)

db52903

* updating readme * Adding prerequisites section * fixing ci issues * linting --------- Co-authored-by: Kaustubh Tangsali <[email protected]> Co-authored-by: Kaustubh Tangsali <[email protected]>

Update README.md (NVIDIA#931)

05959a1

Fix broken ShardTensor link.

Adding requirements.txt for bloodflow, diffusion, and deforming plate…

35e2b57

… samples (NVIDIA#949) * add requirements.txt for bloodflow and deforming plate * move diffusion example (NVIDIA#930) * move diffusion example * update broken links * add requirements for flow reconstruction

Add datapipes docs. (NVIDIA#951)

711536b

* Add datapipes docs. * Fix class names.

update weather examples reqs and readmes to include data download and…

80baa4b

… curation steps (NVIDIA#953) Co-authored-by: Kaustubh Tangsali <[email protected]>

API Documentation Updates for Launch and Utilities (NVIDIA#955)

5a98009

* update logging, launch, utils api docs with added descriptions and examples * update introductory tutorial for typos and added clarity

Snapshot of updates for ST + FSDP tutorial

5e6b6eb

update links to physicsnemo (NVIDIA#958)

d396785

Fix broken link

54b5aab

update links to physicsnemo (NVIDIA#958)

4d2eeb1

Repackage ST+FSDP tutorial model to be more easily and cleanly includ…

74ced64

…ed into the docs.

ensure utility functions are easily in the docs

6e8c2d5

Use nicer formatting extensions for tutorial pages.

1f71508

Updates to fsdp tutorial: bug fix; code blocks

017aee0

Merge branch '1.1.0-rc' into fsdp-tutorial-update

b4a0f35

Update to FSDP tutorial

9fafb4c

coreyjadams changed the base branch from 1.1.0-rc to main August 1, 2025 12:27

update lr_decay_rate to be configurable for CorrDiff

acdd8df

* update lr_decay_rate to be configurable Signed-off-by: jialusui1102 <[email protected]> * update lr_decay_rate comment Signed-off-by: jialusui1102 <[email protected]> --------- Signed-off-by: jialusui1102 <[email protected]>

ktangsali and others added 30 commits August 1, 2025 08:19

Fix license tests to only check for committed files (NVIDIA#997)

c9993f8

* update license header checks to only check for commited files

Fixing GroupNorm edge case with num_channels < min_channels_per_group…

918bc3f

… or not perfectly divisible (NVIDIA#996) * Fixing edge case with num_channels < min_channels_per_group * Raise error if groupnorm sizes don't match

Fix type error in Module's overridable arguments (NVIDIA#1004)

5293946

Fix test_dataloader DistributedSampler (NVIDIA#1002)

cbba021

Add correctness test for deterministic sampler (NVIDIA#993)

a0097c7

* Add correctness test for deterministic ssampler * lint * drop np dep

Use earth2grid HEALPixPad kernels when possible (NVIDIA#1014)

197c735

* Use e2grid healpixpad when possible * Drop unused imports * changelog * formatting

Update blossom-ci.yml (NVIDIA#1016)

736b5a7

Update README (container version, link to curator) (NVIDIA#1018)

9292cca

* address vdr comments * fix lint * fix lint --------- Co-authored-by: root <[email protected]>

Migrate Vortex Shedding Reduced Mesh example to PyG (NVIDIA#1015)

062bfcd

* Migrate Vortex Shedding Reduced Mesh example to PyG * Update CHANGELOG

Refactor ReshapedLayerNorm to handle missing transformer_engine impor…

8da8cb1

…t gracefully (NVIDIA#1019)

Fea tar filter deprecation fix (NVIDIA#1024)

7193f92

* fixed grid effect * added data filter * added data filter * updated comment --------- Co-authored-by: Oliver Hennigh <[email protected]>

Adding MoE recipe for external aerodynamics (NVIDIA#1025)

eafac7c

* adding moe * address review comments, update readme * Small bug fix for preprocessor * address review comments --------- Co-authored-by: root <[email protected]> Co-authored-by: root <[email protected]>

Fea uv support (NVIDIA#1029)

c08dcb6

* fixed grid effect * uv fix * blaa * removed nemo build * added unmannaged --------- Co-authored-by: Oliver Hennigh <[email protected]>

Deletes domino design sensitivities, so it can be moved to PhysicsNeM…

e22befb

…o-CFD (NVIDIA#1032)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fsdp tutorial update #961

Fsdp tutorial update #961

Uh oh!

coreyjadams commented Jun 6, 2025

Uh oh!

Uh oh!

Fsdp tutorial update #961

Are you sure you want to change the base?

Fsdp tutorial update #961

Uh oh!

Conversation

coreyjadams commented Jun 6, 2025

PhysicsNeMo Pull Request

Description

Checklist

Dependencies

Uh oh!

Uh oh!