-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Open
Description
dist_checkpointing stuck on communication with MoE models in distributed environment
Qwen 3 30B Moe models got stuck on all_reduce communication with dist_checkpoint. When running with 32 GPUs, it takes 22 minutes to save checkpoint, and rank 0 takes extra long (36 minutes).
If we do not wrap with the FullyParallelSaveStrategyWrapper
, will still stuck on all_gather
operation.
[rank5]:[E627 12:26:40.073493187 ProcessGroupNCCL.cpp:629] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=29, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600014 milliseconds before timing out.
[rank5]:[E627 12:26:40.073594829 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 5] failure detected by watchdog at work sequence id: 29 PG status: last enqueued work: 29, last completed work: 28
[rank5]:[E627 12:26:40.074084476 ProcessGroupNCCL.cpp:664] Stack trace of the failed collective:
#0 all_reduce from /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2806
#1 wrapper from /usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py:81
#2 sync_all_async_calls from /usr/local/lib/python3.10/dist-packages/megatron/core/dist_checkpointing/strategies/async_utils.py:149
#3 is_current_async_call_done from /usr/local/lib/python3.10/dist-packages/megatron/core/dist_checkpointing/strategies/async_utils.py:228
#4 maybe_finalize_async_calls from /usr/local/lib/python3.10/dist-packages/megatron/core/dist_checkpointing/strategies/async_utils.py:537
#5 save from /usr/local/lib/python3.10/dist-packages/megatron/core/dist_checkpointing/strategies/base.py:228
#6 save from /usr/local/lib/python3.10/dist-packages/megatron/core/dist_checkpointing/strategies/fully_parallel.py:95
#7 save from /usr/local/lib/python3.10/dist-packages/megatron/core/dist_checkpointing/serialization.py:396
#8 save_dist_checkpointing from /opt/tiger/ray/session_2025-06-27_11-23-52_209009_2768/runtime_resources/working_dir_files/_ray_pkg_8d362968e136b949/verl/utils/megatron/dist_checkpointing.py:27
#9 save_checkpoint from /opt/tiger/ray/session_2025-06-27_11-23-52_209009_2768/runtime_resources/working_dir_files/_ray_pkg_8d362968e136b949/verl/utils/checkpoint/megatron_checkpoint_manager.py:356
#10 save_checkpoint from /opt/tiger/ray/session_2025-06-27_11-23-52_209009_2768/runtime_resources/working_dir_files/_ray_pkg_8d362968e136b949/verl/workers/megatron_workers.py:572
#11 inner from /opt/tiger/ray/session_2025-06-27_11-23-52_209009_2768/runtime_resources/working_dir_files/_ray_pkg_8d362968e136b949/verl/single_controller/base/decorator.py:540
#12 func from /opt/tiger/ray/session_2025-06-27_11-23-52_209009_2768/runtime_resources/working_dir_files/_ray_pkg_8d362968e136b949/verl/single_controller/ray/base.py:663
#13 _resume_span from /usr/local/lib/python3.10/dist-packages/ray/util/tracing/tracing_helper.py:467
#14 actor_method_executor from /usr/local/lib/python3.10/dist-packages/ray/_private/function_manager.py:722
#15 main_loop from /usr/local/lib/python3.10/dist-packages/ray/_private/worker.py:892
#16 <module> from /usr/local/lib/python3.10/dist-packages/ray/_private/workers/default_worker.py:327
Qwen 3-32B dense model works fine, only saving in seconds.
Metadata
Metadata
Assignees
Labels
No labels