Skip to content

Conversation

as12138
Copy link
Collaborator

@as12138 as12138 commented Feb 21, 2025

For developers, you can follow the docs: docs/ascend/ascend.rst

This pr is committed for supporting Ascend NPU backend.
Co-authored-by: Chendong98 [email protected]
Co-authored-by: zheliuyu [email protected]
Co-authored-by: celestialli [email protected]
In this pr, we add the capability to determine the type of NPU device and we also add a new script for training on NPU.

These are change lists:

  1. pyproject.toml change verison of vllm
  2. requirements-npu.txt requirements for NPU
  3. verl/bert_padding.py Adapted from https://github.com/mlcommons/training_results_v1.1/blob/main/NVIDIA/benchmarks/bert/implementations/pytorch/padding.py
  4. verl/single_controller/ray/base.py
  5. verl/third_party/vllm/vllm_spmd/dtensor_weight_loaders.py
  6. verl/trainer/fsdp_sft_trainer.py
  7. verl/utils/flops_counter.py
  8. verl/utils/fsdp_utils.py
  9. verl/workers/actor/dp_actor.py
  10. verl/workers/critic/dp_critic.py
  11. verl/workers/fsdp_workers.py
  12. verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py
  13. verl/workers/sharding_manager/fsdp_vllm.py
  14. verl/utils/device.py get device type for different device
  15. docs/ascend/ascend.md

Here are our roadmap:

RoadMap

  • sft
  • ppo
  • grpo

News

[2025.03.31] Add result of SFT and GRPO. Qwen2-7B-Instruct was tested on 2*8 devices, and many params related to batch_size need to be reduced. So this result is only for reference. We will announce the reward results of the default params as soon as sleep mode is supported.

[2025.03.03] Modify the adaptation method of Ray

[2025.02.25] The PPO algorithm is supported for training on NPU with the FSDP backend.

[2025.02.23] The SFT algorithm is supported for training on NPU with the FSDP backend.

[2025.02.21] The GRPO algorithm is supported for training on NPU with the FSDP backend.

Requirements
We use this PR testing on Ascend NPU and GPU to ensure the same codes can run on different devices. The device information is 8 Atlas 800T A2 and 8 A100. Other software information is shown in the following table.

Software Version
transformers 4.47.1
accelerate 1.3.0
torch_npu 2.5.1.rc1
CANN 8.1.RC1 (Not Released)

About mean error
Due to differences in hardware structure, we cannot guarantee that the loss of Ascend NPU is exactly the same as that of the GPU. According to our experience, the loss differences less than 2% is acceptable. If the loss difference is greater than 2%, we will try to fix it. The calculation formula is as follows.
loss_comparison

N represents the number of training steps. For more information, please refer to Calculation accuracy description

@as12138 as12138 changed the title support ASCEND NPU [WIP] support ASCEND NPU Feb 21, 2025
@huangk10
Copy link

does this pr work on multi nodes?

@as12138
Copy link
Collaborator Author

as12138 commented Feb 21, 2025

does this pr work on multi nodes?

I am currently conducting tests on a single node only, and will subsequently supplement with multi-node testing results.

@as12138 as12138 force-pushed the vllm-0.7-npu branch 2 times, most recently from 0afd136 to d496b70 Compare February 21, 2025 07:59
@as12138 as12138 changed the title [WIP] support ASCEND NPU Support FSDP worker and vLLM Ascend Feb 21, 2025
@as12138 as12138 force-pushed the vllm-0.7-npu branch 10 times, most recently from 8b1b207 to 0b7e274 Compare February 22, 2025 06:48
@as12138 as12138 force-pushed the vllm-0.7-npu branch 2 times, most recently from 62af61c to fd62e2e Compare February 24, 2025 01:27
@as12138 as12138 force-pushed the vllm-0.7-npu branch 3 times, most recently from 45f208b to d36c1c7 Compare February 25, 2025 08:07
@CLAassistant
Copy link

CLAassistant commented Feb 26, 2025

CLA assistant check
All committers have signed the CLA.

@as12138 as12138 force-pushed the vllm-0.7-npu branch 3 times, most recently from 6314fcf to d4309a8 Compare March 3, 2025 07:21
@as12138 as12138 force-pushed the vllm-0.7-npu branch 20 times, most recently from a48c6d2 to 74b520f Compare May 23, 2025 01:58
@vermouth1992 vermouth1992 merged commit 0528ba1 into volcengine:main May 23, 2025
37 checks passed
ETOgaosion pushed a commit to Jianbing-D/verl that referenced this pull request Jun 8, 2025
For developers, you can follow the docs: docs/ascend/ascend.rst

This pr is committed for supporting Ascend NPU backend.
Co-authored-by: Chendong98
[[email protected]](mailto:[email protected])
Co-authored-by: zheliuyu <[email protected]>
Co-authored-by: celestialli
[[email protected]](mailto:[email protected])
In this pr, we add the capability to determine the type of NPU device
and we also add a new script for training on NPU.

These are change lists:

1. pyproject.toml change verison of vllm
2. requirements-npu.txt requirements for NPU
3. verl/bert_padding.py Adapted from
https://github.com/mlcommons/training_results_v1.1/blob/main/NVIDIA/benchmarks/bert/implementations/pytorch/padding.py
4. verl/single_controller/ray/base.py
5. verl/third_party/vllm/vllm_spmd/dtensor_weight_loaders.py
6. verl/trainer/fsdp_sft_trainer.py
7. verl/utils/flops_counter.py
8. verl/utils/fsdp_utils.py
9. verl/workers/actor/dp_actor.py
10. verl/workers/critic/dp_critic.py
11. verl/workers/fsdp_workers.py
12. verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py
13. verl/workers/sharding_manager/fsdp_vllm.py
14. verl/utils/device.py get device type for different device
15. docs/ascend/ascend.md 

Here are our roadmap:

**RoadMap**

- [x] sft
- [x] ppo
- [x] grpo

News

[2025.03.31] Add result of SFT and GRPO. Qwen2-7B-Instruct was tested on
2*8 devices, and many params related to batch_size need to be reduced.
So this result is only for reference. We will announce the reward
results of the default params as soon as sleep mode is supported.

[2025.03.03] Modify the adaptation method of Ray

[2025.02.25] The PPO algorithm is supported for training on NPU with the
FSDP backend.

[2025.02.23] The SFT algorithm is supported for training on NPU with the
FSDP backend.

[2025.02.21] The GRPO algorithm is supported for training on NPU with
the FSDP backend.

Requirements
We use this PR testing on Ascend NPU and GPU to ensure the same codes
can run on different devices. The device information is 8 Atlas 800T A2
and 8 A100. Other software information is shown in the following table.

| Software | Version | 
|:-------|-------:|
| transformers  | 4.47.1  | 
| accelerate      | 1.3.0  | 
| torch_npu      | 2.5.1.rc1|
|CANN             | 8.1.RC1 (Not Released)|

About mean error
Due to differences in hardware structure, we cannot guarantee that the
loss of Ascend NPU is exactly the same as that of the GPU. According to
our experience, the loss differences less than 2% is acceptable. If the
loss difference is greater than 2%, we will try to fix it. The
calculation formula is as follows.

![loss_comparison](https://github.com/user-attachments/assets/4f62f713-9240-4324-bf7d-3ae59fc85b05)


N represents the number of training steps. For more information, please
refer to [Calculation accuracy
description](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/LMaccuracy_0001.html)

---------

Co-authored-by: Chendong98 <[email protected]>
Co-authored-by: zheliuyu <[email protected]>
wwwjn pushed a commit to wwwjn/verl that referenced this pull request Jun 10, 2025
For developers, you can follow the docs: docs/ascend/ascend.rst

This pr is committed for supporting Ascend NPU backend.
Co-authored-by: Chendong98
[[email protected]](mailto:[email protected])
Co-authored-by: zheliuyu <[email protected]>
Co-authored-by: celestialli
[[email protected]](mailto:[email protected])
In this pr, we add the capability to determine the type of NPU device
and we also add a new script for training on NPU.

These are change lists:

1. pyproject.toml change verison of vllm
2. requirements-npu.txt requirements for NPU
3. verl/bert_padding.py Adapted from
https://github.com/mlcommons/training_results_v1.1/blob/main/NVIDIA/benchmarks/bert/implementations/pytorch/padding.py
4. verl/single_controller/ray/base.py
5. verl/third_party/vllm/vllm_spmd/dtensor_weight_loaders.py
6. verl/trainer/fsdp_sft_trainer.py
7. verl/utils/flops_counter.py
8. verl/utils/fsdp_utils.py
9. verl/workers/actor/dp_actor.py
10. verl/workers/critic/dp_critic.py
11. verl/workers/fsdp_workers.py
12. verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py
13. verl/workers/sharding_manager/fsdp_vllm.py
14. verl/utils/device.py get device type for different device
15. docs/ascend/ascend.md 

Here are our roadmap:

**RoadMap**

- [x] sft
- [x] ppo
- [x] grpo

News

[2025.03.31] Add result of SFT and GRPO. Qwen2-7B-Instruct was tested on
2*8 devices, and many params related to batch_size need to be reduced.
So this result is only for reference. We will announce the reward
results of the default params as soon as sleep mode is supported.

[2025.03.03] Modify the adaptation method of Ray

[2025.02.25] The PPO algorithm is supported for training on NPU with the
FSDP backend.

[2025.02.23] The SFT algorithm is supported for training on NPU with the
FSDP backend.

[2025.02.21] The GRPO algorithm is supported for training on NPU with
the FSDP backend.

Requirements
We use this PR testing on Ascend NPU and GPU to ensure the same codes
can run on different devices. The device information is 8 Atlas 800T A2
and 8 A100. Other software information is shown in the following table.

| Software | Version | 
|:-------|-------:|
| transformers  | 4.47.1  | 
| accelerate      | 1.3.0  | 
| torch_npu      | 2.5.1.rc1|
|CANN             | 8.1.RC1 (Not Released)|

About mean error
Due to differences in hardware structure, we cannot guarantee that the
loss of Ascend NPU is exactly the same as that of the GPU. According to
our experience, the loss differences less than 2% is acceptable. If the
loss difference is greater than 2%, we will try to fix it. The
calculation formula is as follows.

![loss_comparison](https://github.com/user-attachments/assets/4f62f713-9240-4324-bf7d-3ae59fc85b05)


N represents the number of training steps. For more information, please
refer to [Calculation accuracy
description](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/LMaccuracy_0001.html)

---------

Co-authored-by: Chendong98 <[email protected]>
Co-authored-by: zheliuyu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.