[NPU] feat: Support FSDP worker and vLLM Ascend #332

as12138 · 2025-02-21T03:10:04Z

For developers, you can follow the docs: docs/ascend/ascend.rst

This pr is committed for supporting Ascend NPU backend.
Co-authored-by: Chendong98 [email protected]
Co-authored-by: zheliuyu [email protected]
Co-authored-by: celestialli [email protected]
In this pr, we add the capability to determine the type of NPU device and we also add a new script for training on NPU.

These are change lists:

pyproject.toml change verison of vllm
requirements-npu.txt requirements for NPU
verl/bert_padding.py Adapted from https://github.com/mlcommons/training_results_v1.1/blob/main/NVIDIA/benchmarks/bert/implementations/pytorch/padding.py
verl/single_controller/ray/base.py
verl/third_party/vllm/vllm_spmd/dtensor_weight_loaders.py
verl/trainer/fsdp_sft_trainer.py
verl/utils/flops_counter.py
verl/utils/fsdp_utils.py
verl/workers/actor/dp_actor.py
verl/workers/critic/dp_critic.py
verl/workers/fsdp_workers.py
verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py
verl/workers/sharding_manager/fsdp_vllm.py
verl/utils/device.py get device type for different device
docs/ascend/ascend.md

Here are our roadmap:

RoadMap

sft
ppo
grpo

News

[2025.03.31] Add result of SFT and GRPO. Qwen2-7B-Instruct was tested on 2*8 devices, and many params related to batch_size need to be reduced. So this result is only for reference. We will announce the reward results of the default params as soon as sleep mode is supported.

[2025.03.03] Modify the adaptation method of Ray

[2025.02.25] The PPO algorithm is supported for training on NPU with the FSDP backend.

[2025.02.23] The SFT algorithm is supported for training on NPU with the FSDP backend.

[2025.02.21] The GRPO algorithm is supported for training on NPU with the FSDP backend.

Requirements
We use this PR testing on Ascend NPU and GPU to ensure the same codes can run on different devices. The device information is 8 Atlas 800T A2 and 8 A100. Other software information is shown in the following table.

Software	Version
transformers	4.47.1
accelerate	1.3.0
torch_npu	2.5.1.rc1
CANN	8.1.RC1 (Not Released)

About mean error
Due to differences in hardware structure, we cannot guarantee that the loss of Ascend NPU is exactly the same as that of the GPU. According to our experience, the loss differences less than 2% is acceptable. If the loss difference is greater than 2%, we will try to fix it. The calculation formula is as follows.

N represents the number of training steps. For more information, please refer to Calculation accuracy description

huangk10 · 2025-02-21T06:33:08Z

does this pr work on multi nodes?

as12138 · 2025-02-21T07:00:13Z

does this pr work on multi nodes?

I am currently conducting tests on a single node only, and will subsequently supplement with multi-node testing results.

docs/ascend/ascend.md

pyproject.toml

requirements-npu.txt

verl/utils/fsdp_utils.py

verl/workers/fsdp_workers.py

examples/grpo_trainer/run_qwen2-7b_npu.sh

CLAassistant · 2025-02-26T00:32:21Z

All committers have signed the CLA.

For developers, you can follow the docs: docs/ascend/ascend.rst This pr is committed for supporting Ascend NPU backend. Co-authored-by: Chendong98 [[email protected]](mailto:[email protected]) Co-authored-by: zheliuyu <[email protected]> Co-authored-by: celestialli [[email protected]](mailto:[email protected]) In this pr, we add the capability to determine the type of NPU device and we also add a new script for training on NPU. These are change lists: 1. pyproject.toml change verison of vllm 2. requirements-npu.txt requirements for NPU 3. verl/bert_padding.py Adapted from https://github.com/mlcommons/training_results_v1.1/blob/main/NVIDIA/benchmarks/bert/implementations/pytorch/padding.py 4. verl/single_controller/ray/base.py 5. verl/third_party/vllm/vllm_spmd/dtensor_weight_loaders.py 6. verl/trainer/fsdp_sft_trainer.py 7. verl/utils/flops_counter.py 8. verl/utils/fsdp_utils.py 9. verl/workers/actor/dp_actor.py 10. verl/workers/critic/dp_critic.py 11. verl/workers/fsdp_workers.py 12. verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py 13. verl/workers/sharding_manager/fsdp_vllm.py 14. verl/utils/device.py get device type for different device 15. docs/ascend/ascend.md Here are our roadmap: **RoadMap** - [x] sft - [x] ppo - [x] grpo News [2025.03.31] Add result of SFT and GRPO. Qwen2-7B-Instruct was tested on 2*8 devices, and many params related to batch_size need to be reduced. So this result is only for reference. We will announce the reward results of the default params as soon as sleep mode is supported. [2025.03.03] Modify the adaptation method of Ray [2025.02.25] The PPO algorithm is supported for training on NPU with the FSDP backend. [2025.02.23] The SFT algorithm is supported for training on NPU with the FSDP backend. [2025.02.21] The GRPO algorithm is supported for training on NPU with the FSDP backend. Requirements We use this PR testing on Ascend NPU and GPU to ensure the same codes can run on different devices. The device information is 8 Atlas 800T A2 and 8 A100. Other software information is shown in the following table. | Software | Version | |:-------|-------:| | transformers | 4.47.1 | | accelerate | 1.3.0 | | torch_npu | 2.5.1.rc1| |CANN | 8.1.RC1 (Not Released)| About mean error Due to differences in hardware structure, we cannot guarantee that the loss of Ascend NPU is exactly the same as that of the GPU. According to our experience, the loss differences less than 2% is acceptable. If the loss difference is greater than 2%, we will try to fix it. The calculation formula is as follows. ![loss_comparison](https://github.com/user-attachments/assets/4f62f713-9240-4324-bf7d-3ae59fc85b05) N represents the number of training steps. For more information, please refer to [Calculation accuracy description](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/LMaccuracy_0001.html) --------- Co-authored-by: Chendong98 <[email protected]> Co-authored-by: zheliuyu <[email protected]>

as12138 changed the title ~~support ASCEND NPU~~ [WIP] support ASCEND NPU Feb 21, 2025

as12138 force-pushed the vllm-0.7-npu branch from 7510441 to 8e1637e Compare February 21, 2025 06:23

Yikun mentioned this pull request Feb 21, 2025

Add Ascend NPU support for verl #338

Closed

as12138 force-pushed the vllm-0.7-npu branch 2 times, most recently from 0afd136 to d496b70 Compare February 21, 2025 07:59

as12138 changed the title ~~[WIP] support ASCEND NPU~~ Support FSDP worker and vLLM Ascend Feb 21, 2025

as12138 force-pushed the vllm-0.7-npu branch 10 times, most recently from 8b1b207 to 0b7e274 Compare February 22, 2025 06:48

celestialli reviewed Feb 22, 2025

View reviewed changes

docs/ascend/ascend.md Outdated Show resolved Hide resolved

docs/ascend/ascend.md Outdated Show resolved Hide resolved

FightingZhen reviewed Feb 22, 2025

View reviewed changes

as12138 force-pushed the vllm-0.7-npu branch 2 times, most recently from 62af61c to fd62e2e Compare February 24, 2025 01:27

as12138 commented Feb 24, 2025

View reviewed changes

examples/grpo_trainer/run_qwen2-7b_npu.sh Show resolved Hide resolved

as12138 force-pushed the vllm-0.7-npu branch 3 times, most recently from 45f208b to d36c1c7 Compare February 25, 2025 08:07

as12138 force-pushed the vllm-0.7-npu branch 3 times, most recently from 6314fcf to d4309a8 Compare March 3, 2025 07:21

modify ci docker timeout and license description

fbe16cf

as12138 force-pushed the vllm-0.7-npu branch 20 times, most recently from a48c6d2 to 74b520f Compare May 23, 2025 01:58

modify CI file and remove device_name param

ed22d71

as12138 force-pushed the vllm-0.7-npu branch from 74b520f to ed22d71 Compare May 23, 2025 02:06

as12138 added 3 commits May 23, 2025 11:12

modify default device_name to cuda

1c071c6

modify default device_name to cuda

e94006a

modify default device_name to cuda

b58808d

vermouth1992 approved these changes May 23, 2025

View reviewed changes

vermouth1992 merged commit 0528ba1 into volcengine:main May 23, 2025
37 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NPU] feat: Support FSDP worker and vLLM Ascend #332

[NPU] feat: Support FSDP worker and vLLM Ascend #332

Uh oh!

as12138 commented Feb 21, 2025 •

edited

Loading

Uh oh!

huangk10 commented Feb 21, 2025

Uh oh!

as12138 commented Feb 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CLAassistant commented Feb 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

[NPU] feat: Support FSDP worker and vLLM Ascend #332

[NPU] feat: Support FSDP worker and vLLM Ascend #332

Uh oh!

Conversation

as12138 commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huangk10 commented Feb 21, 2025

Uh oh!

as12138 commented Feb 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CLAassistant commented Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

as12138 commented Feb 21, 2025 •

edited

Loading

CLAassistant commented Feb 26, 2025 •

edited

Loading