Skip to content

[Ascend NPU]Add initial support for Ascend devices #99

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

noemotiovon
Copy link

What does this PR do?

This commit introduces native support for Ascend NPUs in the ROLL project, while preserving compatibility with existing CUDA-based infrastructure.

Key changes include:

  • Introduced a unified device abstraction interface to encapsulate device initialization, memory management, and synchronization, enabling extensibility for both CUDA and Ascend devices.
  • Replaced direct usage of and Ray CUDA resource APIs with the new abstraction layer to support multi-device environments.
  • Integrated Ascend inference backend via vLLM + vLLM-ascend.
  • Added experimental support for training with MindSpeed on Ascend hardware.

This enhancement lays the groundwork for seamless switching across CUDA and Ascend devices.

Related to #83

@CLAassistant
Copy link

CLAassistant commented Jul 15, 2025

CLA assistant check
All committers have signed the CLA.

@noemotiovon
Copy link
Author

Environment

  • CANN: 8.1 RC1
  • torch: 2.5.1
  • torch-npu: 2.5.1
  • vllm: 0.8.4
  • vllm-ascend: 0.8.4rc2
  • deepspeed: 0.16.4
  • sglang: not support now
  • megatron: not support now

@noemotiovon
Copy link
Author

Test 1 Script

bash examples/agentic_demo/run_agentic_pipeline_frozen_lake_single_node_demo.sh

Yaml

defaults:
  - ../config/envs@_here_
  - ../config/deepspeed_zero@_here_
  - ../config/deepspeed_zero2@_here_
  - ../config/deepspeed_zero3@_here_
  - ../config/deepspeed_zero3_cpuoffload@_here_

hydra:
  run:
    dir: .
  output_subdir: null

exp_name: "agentic_pipeline"
seed: 42
logging_dir: ./output/logs
output_dir: ./output
render_save_dir: /home/lichenguang25/tmp/data/oss_bucket_0/yali/output/render
system_envs:
  USE_MODELSCOPE: '1'

#track_with: wandb
#tracker_kwargs:
#  api_key:
#  project: roll-agentic
#  name: ${exp_name}_frozen_lake
#  notes: "agentic_pipeline"
#  tags:
#    - agentic
#    - roll
#    - baseline


track_with: tensorboard
tracker_kwargs:
  log_dir: /home/lichenguang25/tmp/data/oss_bucket_0/yali/llm/tensorboard/roll_exp/agentic_sokoban

num_gpus_per_node: 1

max_steps: 100
save_steps: 10000
logging_steps: 1
eval_steps: 10
resume_from_checkpoint: false

rollout_batch_size: 16
val_batch_size: 16
sequence_length: 4096

reward_clip: 20
advantage_clip: 10.0
ppo_epochs: 1
adv_estimator: "reinforce"
#pg_clip: 0.1
#dual_clip_loss: True
init_kl_coef: 0.0
whiten_advantages: true
entropy_loss_coef: 0

pretrain: Qwen/Qwen2.5-0.5B-Instruct
reward_pretrain: Qwen/Qwen2.5-0.5B-Instruct

actor_train:
  model_args:
    flash_attn: fa2
    disable_gradient_checkpointing: false
    dtype: fp16
    model_type: ~
  training_args:
    learning_rate: 1.0e-6
    weight_decay: 0
    per_device_train_batch_size: 1
    gradient_accumulation_steps: 16
    warmup_steps: 10
  data_args:
    template: qwen2_5
  strategy_args:
    strategy_name: deepspeed_train
    strategy_config: ${deepspeed_zero2}
    # strategy_name: megatron_train
    # strategy_config:
    #   tensor_model_parallel_size: 1
    #   pipeline_model_parallel_size: 1
    #   expert_model_parallel_size: 1
    #   use_distributed_optimizer: true
    #   recompute_granularity: full
  device_mapping: list(range(0,1))
  infer_batch_size: 1

actor_infer:
  model_args:
    flash_attn: fa2
    disable_gradient_checkpointing: true
    dtype: fp16
  generating_args:
    max_new_tokens: 32 # single-turn response length
    top_p: 0.99
    top_k: 100
    num_beams: 1
    temperature: 0.99
    num_return_sequences: 1
  data_args:
    template: qwen2_5
  strategy_args:
    strategy_name: vllm
    strategy_config:
      gpu_memory_utilization: 0.8
      block_size: 16
      load_format: auto
  device_mapping: list(range(0,1))
  infer_batch_size: 1

reference:
  model_args:
    flash_attn: fa2
    disable_gradient_checkpointing: true
    dtype: fp16
    model_type: ~
  data_args:
    template: qwen2_5
  strategy_args:
    strategy_name: hf_infer
    strategy_config: ~
  device_mapping: list(range(0,1))
  infer_batch_size: 1

enable_response_mask: True
action_sep: "||"
use_turn_scores: False # important to GAE when applying token-level rewards to token-level advantages. If False, will take the sum of scores as the reward for the last turn.
enable_think: False # False -> no think RL
max_actions_per_traj: 10
reward_normalization:
  grouping: tags # 可以tags(env_type)/traj_group_id(group)/batch(rollout_batch)... group_by计算reward/adv
  method: identity # asym_clip / identity / mean_std

custom_envs:
  SimpleSokoban:
    env_type: sokoban
    max_actions_per_traj:  ${max_actions_per_traj} # used in environment state manager to control the actual max actions executed per trajectory
    max_steps_per_traj: ${max_actions_per_traj}
    env_instruction: "You are solving the Sokoban puzzle. You are the player and you need to push all boxes to targets. When you are right next to a box, you can push it by moving in the same direction. You cannot push a box through a wall, and you cannot pull a box. The answer must be one of action in a turn, format is <answer>Right</answer>"
    max_tokens: 100 # used to curate llm prompt "max words", not used for rollout
    env_config: # keys should be a subset of SokobanConfig
      dim_x: 6
      dim_y: 6
      num_boxes: 1
      max_steps: ${max_actions_per_traj}
  LargerSokoban:
    env_type: sokoban
    max_actions_per_traj:  ${max_actions_per_traj}
    max_steps_per_traj: ${max_actions_per_traj}
    env_instruction: "You are solving the Sokoban puzzle. You are the player and you need to push all boxes to targets. When you are right next to a box, you can push it by moving in the same direction. You cannot push a box through a wall, and you cannot pull a box. The answer must be one of action in a turn, format is <answer>Right</answer>"
    max_tokens: 100
    env_config:
      dim_x: 8
      dim_y: 8
      num_boxes: 2
      max_steps: ${max_actions_per_traj}
      search_depth: 10
  SokobanDifferentGridVocab:
    env_type: sokoban
    max_actions_per_traj:  ${max_actions_per_traj}
    max_steps_per_traj: ${max_actions_per_traj}
    env_instruction: "You are solving the Sokoban puzzle. You are the player and you need to push all boxes to targets. When you are right next to a box, you can push it by moving in the same direction. You cannot push a box through a wall, and you cannot pull a box. The answer must be one of action in a turn, format is <answer>Right</answer>"
    max_tokens: 100
    env_config: # keys should be a subset of SokobanConfig
      search_depth: 30
      dim_x: 6
      dim_y: 6
      num_boxes: 1
      max_steps: ${max_actions_per_traj}
      grid_lookup: { 0: "W", 1: ".", 2: "G", 3: "C", 4: "B", 5: "A", 6: "@" }
      grid_vocab: { "W": "wall", ".": "empty", "G": "target", "C": "box on target", "B": "box", "A": "player", "@": "player on target" }
  FrozenLake:
    env_type: frozen_lake
    max_actions_per_traj:  ${max_actions_per_traj}
    max_steps_per_traj: ${max_actions_per_traj}
    env_instruction: "You are solving the FrozenLake puzzle. Forbid the whole and go to the target. You may move to the unintended direction due to the slippery ice. The answer must be one of action in a turn, format is <answer>Right</answer>"
    max_tokens: 100
    env_config:
      is_slippery: false

train_env_manager:
  format_penalty: -0.001
  env_groups: 1
  group_size: 1
  tags: [FrozenLake]
  n_groups: [1] # If not set, all env names divide nums equally. Under the same group, the env config and env seed (prompt) are equal in each generation

val_env_manager:
  env_groups: 2
  group_size: 1 # should be set to 1 because val temperature is set to 0 and same prompt leads to same output
  tags: [SimpleSokoban, FrozenLake]
  n_groups: [1, 1] # TODO: If not set, all env names divide nums equally. Under the same group, the env config and env seed (prompt) are equal in each generation

Result

image image image

@noemotiovon
Copy link
Author

noemotiovon commented Jul 29, 2025

Test 2 Script

bash examples/qwen2.5-0.5B-agentic_ds/run_agentic_pipeline_sokoban.sh

Yaml

defaults:
  - ../config/traj_envs@_here_
  - ../config/deepspeed_zero@_here_
  - ../config/deepspeed_zero2@_here_
  - ../config/deepspeed_zero3@_here_
  - ../config/deepspeed_zero3_cpuoffload@_here_

hydra:
  run:
    dir: .
  output_subdir: null

exp_name: "agentic_pipeline"
seed: 42
logging_dir: ./output/logs
output_dir: ./output
render_save_dir: ./output/render
system_envs:
  USE_MODELSCOPE: '1'

#track_with: wandb
#tracker_kwargs:
#  api_key:
#  project: roll-agentic
#  name: ${exp_name}_sokoban
#  notes: "agentic_pipeline"
#  tags:
#    - agentic
#    - roll
#    - baseline

track_with: tensorboard
tracker_kwargs:
  log_dir: /home/lichenguang25/tmp/data/oss_bucket_0/yali/llm/tensorboard/roll_exp/agentic_sokoban


checkpoint_config:
  type: file_system
  output_dir: /home/lichenguang25/tmp/data/cpfs_0/rl_examples/models/${exp_name}

num_gpus_per_node: 4

max_steps: 1024
save_steps: 10000
logging_steps: 1
eval_steps: 10
resume_from_checkpoint: false

rollout_batch_size: 128
val_batch_size: 1024
sequence_length: 2048

advantage_clip: 0.2
ppo_epochs: 1
adv_estimator: "grpo"
#pg_clip: 0.1
#dual_clip_loss: True
init_kl_coef: 0.0
whiten_advantages: true
entropy_loss_coef: 0
max_grad_norm: 1.0

pretrain: Qwen/Qwen2.5-0.5B-Instruct
reward_pretrain: Qwen/Qwen2.5-0.5B-Instruct

actor_train:
  model_args:
    attn_implementation: fa2
    disable_gradient_checkpointing: false
    dtype: bf16
    model_type: ~
  training_args:
    learning_rate: 1.0e-6
    weight_decay: 0
    per_device_train_batch_size: 2
    gradient_accumulation_steps: 64
    warmup_steps: 10
    lr_scheduler_type: cosine
  data_args:
    template: qwen2_5
  strategy_args:
    strategy_name: deepspeed_train
    strategy_config: ${deepspeed_zero3}
    # strategy_name: megatron_train
    # strategy_config:
    #   tensor_model_parallel_size: 1
    #   pipeline_model_parallel_size: 1
    #   expert_model_parallel_size: 1
    #   use_distributed_optimizer: true
    #   recompute_granularity: full
  device_mapping: list(range(0,2))
  infer_batch_size: 1

actor_infer:
  model_args:
    disable_gradient_checkpointing: true
    dtype: bf16
  generating_args:
    max_new_tokens: 128 # single-turn response length
    top_p: 0.99
    top_k: 100
    num_beams: 1
    temperature: 0.99
    num_return_sequences: 1
  data_args:
    template: qwen2_5
  strategy_args:
    strategy_name: vllm
    strategy_config:
      gpu_memory_utilization: 0.8
      block_size: 16
      load_format: auto
  device_mapping: list(range(2,3))

reference:
  model_args:
    attn_implementation: fa2
    disable_gradient_checkpointing: true
    dtype: bf16
    model_type: ~
  data_args:
    template: qwen2_5
  strategy_args:
    strategy_name: hf_infer
    strategy_config: ~
  device_mapping: list(range(3,4))
  infer_batch_size: 1

reward_normalization:
  grouping: traj_group_id # 可以tags(env_type)/traj_group_id(group)/batch(rollout_batch)... group_by计算reward/adv
  method: mean_std # asym_clip / identity / mean_std

train_env_manager:
  format_penalty: -0.15 # sokoban env penalty_for_step=-0.1
  max_env_num_per_worker: 16
  num_env_groups: 128
  # under the same group, the env config and env seed are ensured to be equal
  group_size: 8
  tags: [SimpleSokoban]
  num_groups_partition: [128] # If not set, all env names divide nums equally. Under the same group, the env config and env seed (prompt) are equal in each generation

val_env_manager:
  max_env_num_per_worker: 32
  num_env_groups: 1024
  group_size: 1 # should be set to 1 because val temperature is set to 0 and same prompt leads to same output
  tags: [SimpleSokoban, LargerSokoban, SokobanDifferentGridVocab, FrozenLake]
  num_groups_partition: [256, 256, 256, 256] # TODO: If not set, all env names divide nums equally. Under the same group, the env config and env seed (prompt) are equal in each generation


# Here, you can override variables defined in the imported envs. max_tokens_per_step: 128 in custom_env.SimpleSokoban, here replaced by 64
max_tokens_per_step: 64

custom_envs:
  SimpleSokoban:
    ${custom_env.SimpleSokoban}
  LargerSokoban:
    ${custom_env.LargerSokoban}
  SokobanDifferentGridVocab:
    ${custom_env.SokobanDifferentGridVocab}
  FrozenLake:
    ${custom_env.FrozenLake}
  FrozenLakeThink:
    ${custom_env.FrozenLakeThink}


Result

image

@HuangJoJo
Copy link
Collaborator

Thanks for your Contribution to ROLL! This Design helps a lot for other Hardware support work.

noemotiovon and others added 7 commits August 12, 2025 03:22
…raction

This commit introduces native support for Ascend NPUs in the ROLL project,
while preserving compatibility with existing CUDA-based infrastructure.

Key changes include:
- Introduced a unified device abstraction interface to encapsulate device
  initialization, memory management, and synchronization, enabling extensibility
  for both CUDA and Ascend devices.
- Replaced direct usage of  and Ray CUDA resource APIs with the new
  abstraction layer to support multi-device environments.
- Integrated Ascend inference backend via vLLM + vLLM-ascend.
- Added experimental support for training with MindSpeed on Ascend hardware.

This enhancement lays the groundwork for seamless switching across CUDA and
Ascend devices.

Signed-off-by: noemotiovon <[email protected]>
…raction

This commit introduces native support for Ascend NPUs in the ROLL project,
while preserving compatibility with existing CUDA-based infrastructure.

Key changes include:
- Introduced a unified device abstraction interface to encapsulate device
  initialization, memory management, and synchronization, enabling extensibility
  for both CUDA and Ascend devices.
- Replaced direct usage of  and Ray CUDA resource APIs with the new
  abstraction layer to support multi-device environments.
- Integrated Ascend inference backend via vLLM + vLLM-ascend.
- Added experimental support for training with MindSpeed on Ascend hardware.

This enhancement lays the groundwork for seamless switching across CUDA and
Ascend devices.

Signed-off-by: noemotiovon <[email protected]>

# Conflicts:
#	roll/distributed/executor/cluster.py
#	roll/distributed/executor/worker.py
#	roll/distributed/scheduler/initialize.py
Signed-off-by: noemotiovon <[email protected]>
Signed-off-by: noemotiovon <[email protected]>
@noemotiovon
Copy link
Author

Test 3 Script

bash examples/qwen2.5-0.5B-agentic/run_agentic_pipeline_frozen_lake.sh

Yaml

defaults:
  - ../config/traj_envs@_here_
  - ../config/deepspeed_zero@_here_
  - ../config/deepspeed_zero2@_here_
  - ../config/deepspeed_zero3@_here_
  - ../config/deepspeed_zero3_cpuoffload@_here_

hydra:
  run:
    dir: .
  output_subdir: null

exp_name: "agentic_pipeline"
seed: 42
logging_dir: ./output/logs
output_dir: ./output
render_save_dir: ./output/render
system_envs:
  USE_MODELSCOPE: '1'

#track_with: wandb
#tracker_kwargs:
#  api_key:
#  project: roll-agentic
#  name: ${exp_name}_sokoban
#  notes: "agentic_pipeline"
#  tags:
#    - agentic
#    - roll
#    - baseline

track_with: tensorboard
tracker_kwargs:
  log_dir: /home/lichenguang25/tmp/data/oss_bucket_0/yali/llm/tensorboard/roll_exp/agentic_frozen_lake


checkpoint_config:
  type: file_system
  output_dir: /home/lichenguang25/tmp/data/cpfs_0/rl_examples/models/${exp_name}

num_gpus_per_node: 4

max_steps: 1024
save_steps: 10000
logging_steps: 1
eval_steps: 10
resume_from_checkpoint: false

rollout_batch_size: 128
val_batch_size: 1024
sequence_length: 2048

advantage_clip: 0.2
ppo_epochs: 1
adv_estimator: "grpo"
#pg_clip: 0.1
#dual_clip_loss: True
init_kl_coef: 0.0
whiten_advantages: true
entropy_loss_coef: 0
max_grad_norm: 1.0

pretrain: Qwen/Qwen2.5-0.5B-Instruct
reward_pretrain: Qwen/Qwen2.5-0.5B-Instruct

actor_train:
  model_args:
    attn_implementation: fa2
    disable_gradient_checkpointing: false
    dtype: bf16
    model_type: ~
  training_args:
    learning_rate: 1.0e-6
    weight_decay: 0
    per_device_train_batch_size: 2
    gradient_accumulation_steps: 64
    warmup_steps: 10
    lr_scheduler_type: cosine
  data_args:
    template: qwen2_5
  strategy_args:
    strategy_name: deepspeed_train
    strategy_config: ${deepspeed_zero3}
    # strategy_name: megatron_train
    # strategy_config:
    #   tensor_model_parallel_size: 1
    #   pipeline_model_parallel_size: 1
    #   expert_model_parallel_size: 1
    #   use_distributed_optimizer: true
    #   recompute_granularity: full
  device_mapping: list(range(0,2))
  infer_batch_size: 1

actor_infer:
  model_args:
    disable_gradient_checkpointing: true
    dtype: bf16
  generating_args:
    max_new_tokens: 128 # single-turn response length
    top_p: 0.99
    top_k: 100
    num_beams: 1
    temperature: 0.99
    num_return_sequences: 1
  data_args:
    template: qwen2_5
  strategy_args:
    strategy_name: vllm
    strategy_config:
      gpu_memory_utilization: 0.8
      block_size: 16
      load_format: auto
  device_mapping: list(range(2,3))

reference:
  model_args:
    attn_implementation: fa2
    disable_gradient_checkpointing: true
    dtype: bf16
    model_type: ~
  data_args:
    template: qwen2_5
  strategy_args:
    strategy_name: hf_infer
    strategy_config: ~
  device_mapping: list(range(3,4))
  infer_batch_size: 1

reward_normalization:
  grouping: traj_group_id # 可以tags(env_type)/traj_group_id(group)/batch(rollout_batch)... group_by计算reward/adv
  method: mean_std # asym_clip / identity / mean_std

train_env_manager:
  format_penalty: -0.15 # sokoban env penalty_for_step=-0.1
  max_env_num_per_worker: 16
  num_env_groups: 128
  # under the same group, the env config and env seed are ensured to be equal
  group_size: 8
  tags: [FrozenLake]
  num_groups_partition: [128] # If not set, all env names divide nums equally. Under the same group, the env config and env seed (prompt) are equal in each generation

val_env_manager:
  max_env_num_per_worker: 32
  num_env_groups: 1024
  group_size: 1 # should be set to 1 because val temperature is set to 0 and same prompt leads to same output
  tags: [SimpleSokoban, LargerSokoban, SokobanDifferentGridVocab, FrozenLake]
  num_groups_partition: [256, 256, 256, 256] # TODO: If not set, all env names divide nums equally. Under the same group, the env config and env seed (prompt) are equal in each generation


# Here, you can override variables defined in the imported envs. max_tokens_per_step: 128 in custom_env.SimpleSokoban, here replaced by 64
max_tokens_per_step: 64

custom_envs:
  SimpleSokoban:
    ${custom_env.SimpleSokoban}
  LargerSokoban:
    ${custom_env.LargerSokoban}
  SokobanDifferentGridVocab:
    ${custom_env.SokobanDifferentGridVocab}
  FrozenLake:
    ${custom_env.FrozenLake}
  FrozenLakeThink:
    ${custom_env.FrozenLakeThink}
  FrozenLakeLocallyDefineExamples:  # Can import from unified envs config or define dict locally
    env_type: frozen_lake
    max_tokens_per_step: ${max_tokens_per_step}
    user_prompt_format: ${user_prompt_think_format}
    env_manager_cls: ${env_manager_cls}
    use_thread_lock: true
    env_config:
      env_instruction: "You are solving the FrozenLake puzzle. Forbid the whole and go to the target. You may move to the unintended direction due to the slippery ice. The answer must be one of action in a turn, format is <answer>Right</answer>"
      action_pattern: ${think_action_pattern}
      max_steps: ${max_actions_per_traj}
      is_slippery: false

Result

image

@lowdy1
Copy link

lowdy1 commented Aug 15, 2025

Test 4 Script

python examples/start_agentic_rollout_pipeline.py --config_path qwen2.5-0.5B-agentic  --config_name agentic_rollout_sokoban

yaml

defaults:
  - ../config/traj_envs@_here_
  - ../config/deepspeed_zero@_here_
  - ../config/deepspeed_zero2@_here_
  - ../config/deepspeed_zero3@_here_
  - ../config/deepspeed_zero3_cpuoffload@_here_

hydra:
  run:
    dir: .
  output_subdir: null

exp_name: "agentic_pipeline"
seed: 42
render_save_dir: ./oss_bucket_0/yali/llm/output/render

#track_with: wandb
#tracker_kwargs:
#  api_key:
#  project: roll-agentic
#  name: ${exp_name}_sokoban
#  notes: "agentic_pipeline"
#  tags:
#    - agentic
#    - roll
#    - baseline

track_with: tensorboard
tracker_kwargs:
  log_dir: ./oss_bucket_0/yali/llm/tensorboard/roll_exp/agentic_sokoban

num_gpus_per_node: 4

max_steps: 128
save_steps: 10000

rollout_batch_size: 16
sequence_length: 1024

pretrain: Qwen/Qwen2.5-0.5B-Instruct

actor_infer:
  model_args:
    disable_gradient_checkpointing: true
    dtype: bf16
  generating_args:
    max_new_tokens: 128 # single-turn response length
    top_p: 0.99
    top_k: 100
    num_beams: 1
    temperature: 0.99
    num_return_sequences: 1
  data_args:
    template: qwen2_5
  strategy_args:
    strategy_name: vllm
    strategy_config:
      gpu_memory_utilization: 0.8
      block_size: 16
      load_format: auto # should set 'auto' here, because default load_format is 'dummy'
  device_mapping: list(range(0,4))

train_env_manager:
  format_penalty: -0.15 # sokoban env penalty_for_step=-0.1
  max_env_num_per_worker: 1
  num_env_groups: 1
  # under the same group, the env config and env seed are ensured to be equal
  group_size: 1
  tags: [SimpleSokoban]
  num_groups_partition: [1] # If not set, all env names divide nums equally. Under the same group, the env config and env seed (prompt) are equal in each generation


# Here, you can override variables defined in the imported envs. max_tokens_per_step: 128 in custom_env.SimpleSokoban, here replaced by 64
max_tokens_per_step: 64

custom_envs:
  SimpleSokoban:
    ${custom_env.SimpleSokoban}
  LargerSokoban:
    ${custom_env.LargerSokoban}
  SokobanDifferentGridVocab:
    ${custom_env.SokobanDifferentGridVocab}
  FrozenLake:
    ${custom_env.FrozenLake}
  FrozenLakeThink:
    ${custom_env.FrozenLakeThink}
image

Signed-off-by: noemotiovon <[email protected]>
@noemotiovon
Copy link
Author

Test 5 Script

bash examples/qwen2.5-3B-dpo_megatron/run_dpo_pipeline.sh 

Yaml

defaults:
  - ../config/deepspeed_zero@_here_
  - ../config/deepspeed_zero2@_here_
  - ../config/deepspeed_zero3@_here_
  - ../config/deepspeed_zero3_cpuoffload@_here_

hydra:
  run:
    dir: .
  output_subdir: null

num_gpus_per_node: 4

exp_name: "distill_zero3"
seed: 42
logging_dir: ./output/logs
output_dir: ./output

checkpoint_config:
  type: file_system
  output_dir: /home/lichenguang25/tmp/data/oss_bucket_0/chuye/roll/distill/models/zero3

save_steps: 100
logging_steps: 1
resume_from_checkpoint: false

student_pretrain: Qwen/Qwen2.5-0.5B-Instruct
teacher_pretrain: Qwen/Qwen2.5-1.5B-Instruct

# distill config
distill_loss_weight: 0.85
kd_objective: forward_kl
distill_on_prompt: True

sequence_length: 1024
max_grad_norm: 1.0

question_key: question_zh
answer_key: answer_zh


student:
  model_args:
    attn_implementation: fa2
    disable_gradient_checkpointing: false
    dtype: bf16
    model_type: ~
  training_args:
    learning_rate: 2.0e-5
    weight_decay: 1.0e-2
    lr_scheduler_type: constant
    per_device_train_batch_size: 1
    gradient_accumulation_steps: 1
    warmup_steps: 0
    num_train_epochs: 1
  data_args:
    template: qwen2_5
    file_name:
      - data/GSM8K_zh/GSM8K_zh.json        #https://huggingface.co/datasets/meta-math/GSM8K_zh
    preprocessing_num_workers: 4

  strategy_args:
    strategy_name: deepspeed_train
    strategy_config: ${deepspeed_zero3}
  device_mapping: list(range(0,2))

teacher:
  model_args:
    attn_implementation: fa2
    disable_gradient_checkpointing: true
    dtype: bf16
  data_args:
    template: qwen2_5
  strategy_args:
    strategy_name: deepspeed_infer
    strategy_config: ${deepspeed_zero3}
  device_mapping: list(range(2,4))

system_envs:
  RAY_PROFILING: "0"

Result

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants