-
Notifications
You must be signed in to change notification settings - Fork 2.3k
[NPU] feat: Support FSDP worker and vLLM Ascend #332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
7510441
to
8e1637e
Compare
does this pr work on multi nodes? |
I am currently conducting tests on a single node only, and will subsequently supplement with multi-node testing results. |
0afd136
to
d496b70
Compare
8b1b207
to
0b7e274
Compare
celestialli
reviewed
Feb 22, 2025
62af61c
to
fd62e2e
Compare
as12138
commented
Feb 24, 2025
45f208b
to
d36c1c7
Compare
6314fcf
to
d4309a8
Compare
a48c6d2
to
74b520f
Compare
vermouth1992
approved these changes
May 23, 2025
ETOgaosion
pushed a commit
to Jianbing-D/verl
that referenced
this pull request
Jun 8, 2025
For developers, you can follow the docs: docs/ascend/ascend.rst This pr is committed for supporting Ascend NPU backend. Co-authored-by: Chendong98 [[email protected]](mailto:[email protected]) Co-authored-by: zheliuyu <[email protected]> Co-authored-by: celestialli [[email protected]](mailto:[email protected]) In this pr, we add the capability to determine the type of NPU device and we also add a new script for training on NPU. These are change lists: 1. pyproject.toml change verison of vllm 2. requirements-npu.txt requirements for NPU 3. verl/bert_padding.py Adapted from https://github.com/mlcommons/training_results_v1.1/blob/main/NVIDIA/benchmarks/bert/implementations/pytorch/padding.py 4. verl/single_controller/ray/base.py 5. verl/third_party/vllm/vllm_spmd/dtensor_weight_loaders.py 6. verl/trainer/fsdp_sft_trainer.py 7. verl/utils/flops_counter.py 8. verl/utils/fsdp_utils.py 9. verl/workers/actor/dp_actor.py 10. verl/workers/critic/dp_critic.py 11. verl/workers/fsdp_workers.py 12. verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py 13. verl/workers/sharding_manager/fsdp_vllm.py 14. verl/utils/device.py get device type for different device 15. docs/ascend/ascend.md Here are our roadmap: **RoadMap** - [x] sft - [x] ppo - [x] grpo News [2025.03.31] Add result of SFT and GRPO. Qwen2-7B-Instruct was tested on 2*8 devices, and many params related to batch_size need to be reduced. So this result is only for reference. We will announce the reward results of the default params as soon as sleep mode is supported. [2025.03.03] Modify the adaptation method of Ray [2025.02.25] The PPO algorithm is supported for training on NPU with the FSDP backend. [2025.02.23] The SFT algorithm is supported for training on NPU with the FSDP backend. [2025.02.21] The GRPO algorithm is supported for training on NPU with the FSDP backend. Requirements We use this PR testing on Ascend NPU and GPU to ensure the same codes can run on different devices. The device information is 8 Atlas 800T A2 and 8 A100. Other software information is shown in the following table. | Software | Version | |:-------|-------:| | transformers | 4.47.1 | | accelerate | 1.3.0 | | torch_npu | 2.5.1.rc1| |CANN | 8.1.RC1 (Not Released)| About mean error Due to differences in hardware structure, we cannot guarantee that the loss of Ascend NPU is exactly the same as that of the GPU. According to our experience, the loss differences less than 2% is acceptable. If the loss difference is greater than 2%, we will try to fix it. The calculation formula is as follows.  N represents the number of training steps. For more information, please refer to [Calculation accuracy description](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/LMaccuracy_0001.html) --------- Co-authored-by: Chendong98 <[email protected]> Co-authored-by: zheliuyu <[email protected]>
wwwjn
pushed a commit
to wwwjn/verl
that referenced
this pull request
Jun 10, 2025
For developers, you can follow the docs: docs/ascend/ascend.rst This pr is committed for supporting Ascend NPU backend. Co-authored-by: Chendong98 [[email protected]](mailto:[email protected]) Co-authored-by: zheliuyu <[email protected]> Co-authored-by: celestialli [[email protected]](mailto:[email protected]) In this pr, we add the capability to determine the type of NPU device and we also add a new script for training on NPU. These are change lists: 1. pyproject.toml change verison of vllm 2. requirements-npu.txt requirements for NPU 3. verl/bert_padding.py Adapted from https://github.com/mlcommons/training_results_v1.1/blob/main/NVIDIA/benchmarks/bert/implementations/pytorch/padding.py 4. verl/single_controller/ray/base.py 5. verl/third_party/vllm/vllm_spmd/dtensor_weight_loaders.py 6. verl/trainer/fsdp_sft_trainer.py 7. verl/utils/flops_counter.py 8. verl/utils/fsdp_utils.py 9. verl/workers/actor/dp_actor.py 10. verl/workers/critic/dp_critic.py 11. verl/workers/fsdp_workers.py 12. verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py 13. verl/workers/sharding_manager/fsdp_vllm.py 14. verl/utils/device.py get device type for different device 15. docs/ascend/ascend.md Here are our roadmap: **RoadMap** - [x] sft - [x] ppo - [x] grpo News [2025.03.31] Add result of SFT and GRPO. Qwen2-7B-Instruct was tested on 2*8 devices, and many params related to batch_size need to be reduced. So this result is only for reference. We will announce the reward results of the default params as soon as sleep mode is supported. [2025.03.03] Modify the adaptation method of Ray [2025.02.25] The PPO algorithm is supported for training on NPU with the FSDP backend. [2025.02.23] The SFT algorithm is supported for training on NPU with the FSDP backend. [2025.02.21] The GRPO algorithm is supported for training on NPU with the FSDP backend. Requirements We use this PR testing on Ascend NPU and GPU to ensure the same codes can run on different devices. The device information is 8 Atlas 800T A2 and 8 A100. Other software information is shown in the following table. | Software | Version | |:-------|-------:| | transformers | 4.47.1 | | accelerate | 1.3.0 | | torch_npu | 2.5.1.rc1| |CANN | 8.1.RC1 (Not Released)| About mean error Due to differences in hardware structure, we cannot guarantee that the loss of Ascend NPU is exactly the same as that of the GPU. According to our experience, the loss differences less than 2% is acceptable. If the loss difference is greater than 2%, we will try to fix it. The calculation formula is as follows.  N represents the number of training steps. For more information, please refer to [Calculation accuracy description](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/LMaccuracy_0001.html) --------- Co-authored-by: Chendong98 <[email protected]> Co-authored-by: zheliuyu <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
For developers, you can follow the docs: docs/ascend/ascend.rst
This pr is committed for supporting Ascend NPU backend.
Co-authored-by: Chendong98 [email protected]
Co-authored-by: zheliuyu [email protected]
Co-authored-by: celestialli [email protected]
In this pr, we add the capability to determine the type of NPU device and we also add a new script for training on NPU.
These are change lists:
Here are our roadmap:
RoadMap
News
[2025.03.31] Add result of SFT and GRPO. Qwen2-7B-Instruct was tested on 2*8 devices, and many params related to batch_size need to be reduced. So this result is only for reference. We will announce the reward results of the default params as soon as sleep mode is supported.
[2025.03.03] Modify the adaptation method of Ray
[2025.02.25] The PPO algorithm is supported for training on NPU with the FSDP backend.
[2025.02.23] The SFT algorithm is supported for training on NPU with the FSDP backend.
[2025.02.21] The GRPO algorithm is supported for training on NPU with the FSDP backend.
Requirements
We use this PR testing on Ascend NPU and GPU to ensure the same codes can run on different devices. The device information is 8 Atlas 800T A2 and 8 A100. Other software information is shown in the following table.
About mean error

Due to differences in hardware structure, we cannot guarantee that the loss of Ascend NPU is exactly the same as that of the GPU. According to our experience, the loss differences less than 2% is acceptable. If the loss difference is greater than 2%, we will try to fix it. The calculation formula is as follows.
N represents the number of training steps. For more information, please refer to Calculation accuracy description