Skip to content

[Bug] dist_checkpointing stuck on communication with MoE models in distributed environment #2238

@ETOgaosion

Description

@ETOgaosion

dist_checkpointing stuck on communication with MoE models in distributed environment

Qwen 3 30B Moe models got stuck on all_reduce communication with dist_checkpoint. When running with 32 GPUs, it takes 22 minutes to save checkpoint, and rank 0 takes extra long (36 minutes).

If we do not wrap with the FullyParallelSaveStrategyWrapper, will still stuck on all_gather operation.

[rank5]:[E627 12:26:40.073493187 ProcessGroupNCCL.cpp:629] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=29, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600014 milliseconds before timing out.
[rank5]:[E627 12:26:40.073594829 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 5]  failure detected by watchdog at work sequence id: 29 PG status: last enqueued work: 29, last completed work: 28
[rank5]:[E627 12:26:40.074084476 ProcessGroupNCCL.cpp:664] Stack trace of the failed collective: 
#0 all_reduce from /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2806
#1 wrapper from /usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py:81
#2 sync_all_async_calls from /usr/local/lib/python3.10/dist-packages/megatron/core/dist_checkpointing/strategies/async_utils.py:149
#3 is_current_async_call_done from /usr/local/lib/python3.10/dist-packages/megatron/core/dist_checkpointing/strategies/async_utils.py:228
#4 maybe_finalize_async_calls from /usr/local/lib/python3.10/dist-packages/megatron/core/dist_checkpointing/strategies/async_utils.py:537
#5 save from /usr/local/lib/python3.10/dist-packages/megatron/core/dist_checkpointing/strategies/base.py:228
#6 save from /usr/local/lib/python3.10/dist-packages/megatron/core/dist_checkpointing/strategies/fully_parallel.py:95
#7 save from /usr/local/lib/python3.10/dist-packages/megatron/core/dist_checkpointing/serialization.py:396
#8 save_dist_checkpointing from /opt/tiger/ray/session_2025-06-27_11-23-52_209009_2768/runtime_resources/working_dir_files/_ray_pkg_8d362968e136b949/verl/utils/megatron/dist_checkpointing.py:27
#9 save_checkpoint from /opt/tiger/ray/session_2025-06-27_11-23-52_209009_2768/runtime_resources/working_dir_files/_ray_pkg_8d362968e136b949/verl/utils/checkpoint/megatron_checkpoint_manager.py:356
#10 save_checkpoint from /opt/tiger/ray/session_2025-06-27_11-23-52_209009_2768/runtime_resources/working_dir_files/_ray_pkg_8d362968e136b949/verl/workers/megatron_workers.py:572
#11 inner from /opt/tiger/ray/session_2025-06-27_11-23-52_209009_2768/runtime_resources/working_dir_files/_ray_pkg_8d362968e136b949/verl/single_controller/base/decorator.py:540
#12 func from /opt/tiger/ray/session_2025-06-27_11-23-52_209009_2768/runtime_resources/working_dir_files/_ray_pkg_8d362968e136b949/verl/single_controller/ray/base.py:663
#13 _resume_span from /usr/local/lib/python3.10/dist-packages/ray/util/tracing/tracing_helper.py:467
#14 actor_method_executor from /usr/local/lib/python3.10/dist-packages/ray/_private/function_manager.py:722
#15 main_loop from /usr/local/lib/python3.10/dist-packages/ray/_private/worker.py:892
#16 <module> from /usr/local/lib/python3.10/dist-packages/ray/_private/workers/default_worker.py:327

Qwen 3-32B dense model works fine, only saving in seconds.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions