Skip to content

Conversation

taozhiwei
Copy link
Contributor

@taozhiwei taozhiwei commented Aug 12, 2025

This change is to enable _get_default_process_group_backend_for_device to support more hardware platforms and no longer hard code for cuda


📚 Documentation preview 📚: https://pytorch-lightning--21057.org.readthedocs.build/en/21057/

@github-actions github-actions bot added the fabric lightning.fabric.Fabric label Aug 12, 2025
@Borda Borda changed the title let _get_default_process_group_backend_for_device support more hardware platforms let _get_default_process_group_backend_for_device support more hardware platforms Aug 12, 2025
Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, just please add a test with mock if needed :)

Copy link

codecov bot commented Aug 12, 2025

Codecov Report

❌ Patch coverage is 75.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 87%. Comparing base (5a2b678) to head (5f43a0c).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #21057   +/-   ##
=======================================
- Coverage      87%      87%   -0%     
=======================================
  Files         268      268           
  Lines       23367    23370    +3     
=======================================
+ Hits        20379    20381    +2     
- Misses       2988     2989    +1     

@taozhiwei
Copy link
Contributor Author

looks good, just please add a test with mock if needed :)

I have added the test, and it seems that the last CI failure was caused by the running environment, could you please help trigger it again? Thank you

…l _get_default_process_group_backend_for_device

Signed-off-by: taozhiwei <[email protected]>
@taozhiwei taozhiwei changed the title let _get_default_process_group_backend_for_device support more hardware platforms let _get_default_process_group_backend_for_device support more hardware platforms Aug 13, 2025
@Borda Borda changed the title let _get_default_process_group_backend_for_device support more hardware platforms let _get_default_process_group_backend_for_device support more hardware platforms Aug 13, 2025
Co-authored-by: Nicki Skafte Detlefsen <[email protected]>
@taozhiwei
Copy link
Contributor Author

It seems this CI failure is not related to this change. Can you help trigger another CI? Thank you
@SkafteNicki @bhimrazy

@taozhiwei
Copy link
Contributor Author

taozhiwei commented Aug 14, 2025

@SkafteNicki
Copy link
Collaborator

@taozhiwei yes it seems unrelated to this issue. When we get it fixed the PR will be merged.

@taozhiwei
Copy link
Contributor Author

yes it seems unrelated to this issue. When we get it fixed the PR will be merged.

Thank you very much !

@Borda Borda merged commit 119a640 into Lightning-AI:master Aug 15, 2025
117 of 120 checks passed
Borda added a commit that referenced this pull request Aug 18, 2025
@Borda
Copy link
Member

Borda commented Aug 18, 2025

needs to be reverted and added again as this breaks DDP fork...
@taozhiwei could you pls try to add it again 🦩

Borda added a commit that referenced this pull request Aug 18, 2025
…1057 (#21092)

* debug failing tests for Fabric with `ddp_fork` on PT 2.8

* Revert "let  `_get_default_process_group_backend_for_device` support more hardware platforms (#21057)"

This reverts commit 119a640.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Borda added a commit that referenced this pull request Aug 18, 2025
…ware platforms (#21057)

* support more hardware platforms and no longer hard code cuda when call _get_default_process_group_backend_for_device
* Apply suggestions from code review

---------

Signed-off-by: taozhiwei <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Nicki Skafte Detlefsen <[email protected]>
Co-authored-by: Jirka Borovec <[email protected]>
(cherry picked from commit 119a640)
Borda added a commit that referenced this pull request Aug 28, 2025
…ware platforms (#21057)

* support more hardware platforms and no longer hard code cuda when call _get_default_process_group_backend_for_device
* Apply suggestions from code review

---------

Signed-off-by: taozhiwei <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Nicki Skafte Detlefsen <[email protected]>
Co-authored-by: Jirka Borovec <[email protected]>
(cherry picked from commit 119a640)
Borda added a commit that referenced this pull request Aug 28, 2025
…1057 (#21092)

* debug failing tests for Fabric with `ddp_fork` on PT 2.8

* Revert "let  `_get_default_process_group_backend_for_device` support more hardware platforms (#21057)"

This reverts commit 119a640.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
(cherry picked from commit 9ca360b)
lantiga pushed a commit that referenced this pull request Aug 29, 2025
…ware platforms (#21057)

* support more hardware platforms and no longer hard code cuda when call _get_default_process_group_backend_for_device
* Apply suggestions from code review

---------

Signed-off-by: taozhiwei <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Nicki Skafte Detlefsen <[email protected]>
Co-authored-by: Jirka Borovec <[email protected]>
(cherry picked from commit 119a640)
lantiga pushed a commit that referenced this pull request Aug 29, 2025
…1057 (#21092)

* debug failing tests for Fabric with `ddp_fork` on PT 2.8

* Revert "let  `_get_default_process_group_backend_for_device` support more hardware platforms (#21057)"

This reverts commit 119a640.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
(cherry picked from commit 9ca360b)
Borda added a commit that referenced this pull request Sep 4, 2025
* let  `_get_default_process_group_backend_for_device` support more hardware platforms (#21057)

* support more hardware platforms and no longer hard code cuda when call _get_default_process_group_backend_for_device
* Apply suggestions from code review

---------

* try it
* chlog

---------

Signed-off-by: taozhiwei <[email protected]>
Co-authored-by: taozhiwei <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Nicki Skafte Detlefsen <[email protected]>
Borda added a commit that referenced this pull request Sep 4, 2025
* let  `_get_default_process_group_backend_for_device` support more hardware platforms (#21057)

* support more hardware platforms and no longer hard code cuda when call _get_default_process_group_backend_for_device
* Apply suggestions from code review

---------

* try it
* chlog

---------

Signed-off-by: taozhiwei <[email protected]>
Co-authored-by: taozhiwei <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Nicki Skafte Detlefsen <[email protected]>

(cherry picked from commit 3c81316)
lantiga pushed a commit that referenced this pull request Sep 5, 2025
* let  `_get_default_process_group_backend_for_device` support more hardware platforms (#21057)

* support more hardware platforms and no longer hard code cuda when call _get_default_process_group_backend_for_device
* Apply suggestions from code review

---------

* try it
* chlog

---------

Signed-off-by: taozhiwei <[email protected]>
Co-authored-by: taozhiwei <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Nicki Skafte Detlefsen <[email protected]>

(cherry picked from commit 3c81316)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fabric lightning.fabric.Fabric
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants