GRPO + LoRA Fine-tuning Qwen3-32B Failed on 8x A2 910B3 with VLLM-Ascend 0.9.1

### System Info

----------Python Info----------
Version      : 3.11.13
Compiler     : GCC 11.4.0
Build        : ('main', 'Jul 26 2025 07:27:32')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 25.1.1
Directory    : /usr/local/python3.11.13/lib/python3.11/site-packages/pip
vllm	     : 0.9.1+empty
sglang	     : not found.
ray	     : 2.46.0
torch	     : 2.5.1
----------verl Info-----------
Version      : 0.5.0.dev
Directory    : /bigdata/sde/Verl/verl/verl
Commit Hash  : 0d4541f397828843525b3f3a7eadff03d56ff24c
----------Platform Info----------
Platform     : Linux-5.10.0-60.18.0.50.oe2203.aarch64-aarch64-with-glibc2.35
system       : Linux
node         : localhost.localdomain
release      : 5.10.0-60.18.0.50.oe2203.aarch64
version      : #1 SMP Wed Mar 30 02:43:08 UTC 2022
----------Environment----------
CUDA is not available.
----------System Info----------
Failed to execute nvidia-smi command.
CPU Memory	: 2010.33 GB
GPU Count	: 0
root@localhost:/bigdata/sde/Verl/verl# npu-smi info
+------------------------------------------------------------------------------------------------+
| npu-smi 24.1.rc3                 Version: 24.1.rc3                                             |
+---------------------------+---------------+----------------------------------------------------+
| NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
| Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
+===========================+===============+====================================================+
| 0     910B3               | OK            | 112.5       37                0    / 0             |
| 0                         | 0000:C1:00.0  | 6           0    / 0          33551/ 65536         |
+===========================+===============+====================================================+
| 1     910B3               | OK            | 103.6       36                0    / 0             |
| 0                         | 0000:C2:00.0  | 6           0    / 0          42641/ 65536         |
+===========================+===============+====================================================+
| 2     910B3               | OK            | 103.0       37                0    / 0             |
| 0                         | 0000:81:00.0  | 6           0    / 0          42642/ 65536         |
+===========================+===============+====================================================+
| 3     910B3               | OK            | 104.0       37                0    / 0             |
| 0                         | 0000:82:00.0  | 6           0    / 0          42642/ 65536         |
+===========================+===============+====================================================+
| 4     910B3               | OK            | 103.7       40                0    / 0             |
| 0                         | 0000:01:00.0  | 6           0    / 0          42641/ 65536         |
+===========================+===============+====================================================+
| 5     910B3               | OK            | 107.2       43                0    / 0             |
| 0                         | 0000:02:00.0  | 6           0    / 0          42639/ 65536         |
+===========================+===============+====================================================+
| 6     910B3               | OK            | 110.2       40                0    / 0             |
| 0                         | 0000:41:00.0  | 6           0    / 0          42641/ 65536         |
+===========================+===============+====================================================+
| 7     910B3               | OK            | 100.0       41                0    / 0             |
| 0                         | 0000:42:00.0  | 6           0    / 0          42641/ 65536         |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU     Chip              | Process id    | Process name             | Process memory(MB)      |
+===========================+===============+====================================================+
| 0       0                 | 108846        | rayWorkerDict            | 30213                   |
+===========================+===============+====================================================+
| 1       0                 | 108847        | rayWorkerDict            | 39299                   |
+===========================+===============+====================================================+
| 2       0                 | 108848        | rayWorkerDict            | 39299                   |
+===========================+===============+====================================================+
| 3       0                 | 108849        | rayWorkerDict            | 39299                   |
+===========================+===============+====================================================+
| 4       0                 | 108850        | rayWorkerDict            | 39299                   |
+===========================+===============+====================================================+
| 5       0                 | 108851        | rayWorkerDict            | 39299                   |
+===========================+===============+====================================================+
| 6       0                 | 108852        | rayWorkerDict            | 39299                   |
+===========================+===============+====================================================+
| 7       0                 | 108853        | rayWorkerDict            | 39299                   |
+===========================+===============+====================================================+


### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

Environment:
Hardware: 1x Atlas 800 (8x A2 910B3, 64GB VRAM each)
VLLM-Ascend: 0.9.1
VerL: main branch
Model: Qwen3-32B
Training Method: GRPO + LoRA
CANN: 8.2.0.0.201 (RC1)
Driver: 24.1.rc3
Name: torch-npu Version: 2.5.1.post1
Name: torch Version: 2.5.1

```shell

```

 Log:

```shell
+ export WANDB_MODE=disabled
+ export HYDRA_FULL_ERROR=1
+ project_name=GRPO-Qwen3
+ exp_name=GRPO-Qwen3-32B-npu-lora
+ gen_tp=8
+ RAY_DATA_HOME=/root/verl
+ MODEL_PATH=/bigdata/sdb/Qwen3-32B
+ CKPTS_DIR=/bigdata/sde/Verl/verl/ckpts/GRPO-Qwen3/GRPO-Qwen3-32B-npu-lora
+ TRAIN_FILE=/bigdata/sde/Verl/dataset/gsm8k/train.parquet
+ TEST_FILE=/bigdata/sde/Verl/dataset/gsm8k/test.parquet
+ python3 -m verl.trainer.main_ppo algorithm.adv_estimator=grpo data.train_files=/bigdata/sde/Verl/dataset/gsm8k/train.parquet data.val_files=/bigdata/sde/Verl/dataset/gsm8k/test.parquet data.train_batch_size=8 data.max_prompt_length=2048 data.max_response_length=512 data.filter_overlong_prompts=False data.truncation=error data.shuffle=False actor_rollout_ref.model.path=/bigdata/sdb/Qwen3-32B actor_rollout_ref.model.use_shm=True actor_rollout_ref.model.lora_rank=8 actor_rollout_ref.model.lora_alpha=16 actor_rollout_ref.model.target_modules=all-linear actor_rollout_ref.actor.optim.lr=1e-6 actor_rollout_ref.model.use_remove_padding=True actor_rollout_ref.actor.ppo_mini_batch_size=8 actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 actor_rollout_ref.actor.use_kl_loss=True actor_rollout_ref.actor.kl_loss_coef=0.001 actor_rollout_ref.actor.kl_loss_type=low_var_kl actor_rollout_ref.actor.entropy_coeff=0 actor_rollout_ref.rollout.max_num_seqs=8 actor_rollout_ref.rollout.max_model_len=2560 actor_rollout_ref.rollout.max_num_batched_tokens=2560 actor_rollout_ref.model.enable_gradient_checkpointing=True actor_rollout_ref.actor.fsdp_config.param_offload=False actor_rollout_ref.actor.fsdp_config.optimizer_offload=False actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 actor_rollout_ref.rollout.tensor_model_parallel_size=8 actor_rollout_ref.rollout.name=vllm actor_rollout_ref.rollout.gpu_memory_utilization=0.2 actor_rollout_ref.rollout.n=2 actor_rollout_ref.rollout.load_format=safetensors actor_rollout_ref.rollout.layered_summon=True actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 actor_rollout_ref.ref.fsdp_config.param_offload=True actor_rollout_ref.ref.fsdp_config.optimizer_offload=True algorithm.use_kl_in_reward=False trainer.critic_warmup=0 'trainer.logger=["console","wandb"]' trainer.project_name=GRPO-Qwen3 trainer.experiment_name=GRPO-Qwen3-32B-npu-lora trainer.n_gpus_per_node=8 trainer.nnodes=1 trainer.default_local_dir=/bigdata/sde/Verl/verl/ckpts/GRPO-Qwen3/GRPO-Qwen3-32B-npu-lora trainer.device=npu trainer.resume_mode=auto actor_rollout_ref.actor.fsdp_config.forward_prefetch=True actor_rollout_ref.ref.fsdp_config.forward_prefetch=True ++actor_rollout_ref.actor.entropy_from_logits_with_chunking=True ++actor_rollout_ref.ref.entropy_from_logits_with_chunking=True trainer.val_before_train=True trainer.save_freq=20 trainer.test_freq=5 trainer.total_epochs=2
ray init kwargs: {'num_cpus': None, 'runtime_env': {'env_vars': {'TOKENIZERS_PARALLELISM': 'true', 'NCCL_DEBUG': 'WARN', 'VLLM_LOGGING_LEVEL': 'WARN', 'VLLM_ALLOW_RUNTIME_LORA_UPDATING': 'true', 'CUDA_DEVICE_MAX_CONNECTIONS': '1', 'NCCL_CUMEM_ENABLE': '0'}, 'working_dir': None}}
2025-09-22 10:49:13,042	INFO worker.py:1879 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 
(TaskRunner pid=91866) TaskRunner hostname: localhost.localdomain, PID: 91866
(TaskRunner pid=91866) {'actor_rollout_ref': {'actor': {'_target_': 'verl.workers.config.FSDPActorConfig',
(.............
(TaskRunner pid=91866) WARNING 09-22 10:49:26 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: https://github.com/NVIDIA/nccl/issues/1234
(TaskRunner pid=91866) WARNING 09-22 10:49:26 [importing.py:29] Triton is not installed. Using dummy decorators. Install it via `pip install triton` to enable kernel compilation.
(TaskRunner pid=91866) WARNING 09-22 10:49:29 [_custom_ops.py:22] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
(TaskRunner pid=91866) /bigdata/sde/Verl/verl/verl/trainer/main_ppo.py:264: UserWarning: Disabled critic as algorithm.adv_estimator != gae. If it is not intended, please set critic.enable=True
(TaskRunner pid=91866)   use_critic=need_critic(config),
(TaskRunner pid=91866) [validate_config] All configuration checks passed successfully!
(TaskRunner pid=91866) [WARNING]: The memory model path /dev/shm/verl-cache/ab35dd8c922b74d470b9a5d969f8a17e/Qwen3-32B already exists. If it is not you want, please clear it and restart the task.
(TaskRunner pid=91866) /bigdata/sde/Verl/verl/verl/utils/profiler/config.py:49: UserWarning: Torch profiler tool config is not fully supported now.
(TaskRunner pid=91866)   warnings.warn("Torch profiler tool config is not fully supported now.", stacklevel=1)
(TaskRunner pid=91866) Using dataset class: RLHFDataset
(TaskRunner pid=91866) dataset len: 7473
(TaskRunner pid=91866) Using dataset class: RLHFDataset
(TaskRunner pid=91866) dataset len: 1319
(TaskRunner pid=91866) Size of train dataloader: 934, Size of val dataloader: 1
(TaskRunner pid=91866) Total training steps: 1868
(TaskRunner pid=91866) colocated worker base class <class 'verl.single_controller.base.worker.Worker'>
(TaskRunner pid=91866) /bigdata/sde/Verl/verl/verl/trainer/ppo/ray_trainer.py:324: UserWarning: Disabled critic as algorithm.adv_estimator != gae. If it is not intended, please set critic.enable=True
(TaskRunner pid=91866)   self.use_critic = need_critic(self.config)
(pid=108846) WARNING 09-22 10:50:31 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: https://github.com/NVIDIA/nccl/issues/1234
(pid=108846) WARNING 09-22 10:50:31 [importing.py:29] Triton is not installed. Using dummy decorators. Install it via `pip install triton` to enable kernel compilation.
(pid=108850) WARNING 09-22 10:50:35 [_custom_ops.py:22] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
(WorkerDict pid=108846) [W922 10:50:38.015328500 compiler_depend.ts:989] Warning: The watchdog timeout 600000ms(which is set by init_process_group) is less than or equal to HCCL execution timeout 1836000ms! The plog may not be recorded. (function ProcessGroupHCCL)
(WorkerDict pid=108847) [WARNING]: The memory model path /dev/shm/verl-cache/ab35dd8c922b74d470b9a5d969f8a17e/Qwen3-32B already exists. If it is not you want, please clear it and restart the task.
(pid=108853) WARNING 09-22 10:50:32 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: https://github.com/NVIDIA/nccl/issues/1234 [repeated 7x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(pid=108853) WARNING 09-22 10:50:32 [importing.py:29] Triton is not installed. Using dummy decorators. Install it via `pip install triton` to enable kernel compilation. [repeated 7x across cluster]
Loading checkpoint shards:   0%|          | 0/17 [00:00<?, ?it/s]
Loading checkpoint shards:   6%|?         | 1/17 [00:03<01:00,  3.75s/it]
(WorkerDict pid=108846) Model config after override: Qwen3Config {
(WorkerDict pid=108846)   "architectures": [
(WorkerDict pid=108846)     "Qwen3ForCausalLM"
(WorkerDict pid=108846)   ],
(WorkerDict pid=108846)   "attention_bias": false,
(WorkerDict pid=108846)   "attention_dropout": 0.0,
(WorkerDict pid=108846)   "eos_token_id": 151645,
(WorkerDict pid=108846)   "head_dim": 128,
(WorkerDict pid=108846)   "hidden_act": "silu",
(WorkerDict pid=108846)   "hidden_size": 5120,
(WorkerDict pid=108846)   "initializer_range": 0.02,
(WorkerDict pid=108846)   "intermediate_size": 25600,
(WorkerDict pid=108846)   "max_position_embeddings": 40960,
(WorkerDict pid=108846)   "max_window_layers": 64,
(WorkerDict pid=108846)   "model_type": "qwen3",
(WorkerDict pid=108846)   "num_attention_heads": 64,
(WorkerDict pid=108846)   "num_hidden_layers": 64,
(WorkerDict pid=108846)   "num_key_value_heads": 8,
(WorkerDict pid=108846)   "pad_token_id": 151643,
(WorkerDict pid=108846)   "rms_norm_eps": 1e-06,
(WorkerDict pid=108846)   "rope_scaling": null,
(WorkerDict pid=108846)   "rope_theta": 1000000,
(WorkerDict pid=108846)   "sliding_window": null,
(WorkerDict pid=108846)   "tie_word_embeddings": false,
(WorkerDict pid=108846)   "torch_dtype": "bfloat16",
(WorkerDict pid=108846)   "transformers_version": "4.52.4",
(WorkerDict pid=108846)   "use_cache": true,
(WorkerDict pid=108846)   "use_sliding_window": false,
(WorkerDict pid=108846)   "vocab_size": 151936
(WorkerDict pid=108846) }
(WorkerDict pid=108846) 
(pid=108853) WARNING 09-22 10:50:36 [_custom_ops.py:22] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'") [repeated 7x across cluster]
(WorkerDict pid=108853) [WARNING]: The memory model path /dev/shm/verl-cache/ab35dd8c922b74d470b9a5d969f8a17e/Qwen3-32B already exists. If it is not you want, please clear it and restart the task. [repeated 7x across cluster]
Loading checkpoint shards:   0%|          | 0/17 [00:00<?, ?it/s] [repeated 7x across cluster]
Loading checkpoint shards:   6%|?         | 1/17 [00:03<00:57,  3.59s/it] [repeated 14x across cluster]
Loading checkpoint shards:  24%|???       | 4/17 [00:15<00:50,  3.89s/it] [repeated 12x across cluster]
Loading checkpoint shards:  29%|???       | 5/17 [00:20<00:48,  4.05s/it] [repeated 12x across cluster]
Loading checkpoint shards:  41%|????      | 7/17 [00:26<00:36,  3.63s/it] [repeated 12x across cluster]
Loading checkpoint shards:  53%|??????    | 9/17 [00:31<00:27,  3.45s/it] [repeated 12x across cluster]
Loading checkpoint shards:  59%|??????    | 10/17 [00:37<00:25,  3.70s/it] [repeated 12x across cluster]
Loading checkpoint shards:  71%|???????   | 12/17 [00:42<00:17,  3.52s/it] [repeated 12x across cluster]
Loading checkpoint shards:  76%|????????  | 13/17 [00:47<00:14,  3.58s/it] [repeated 12x across cluster]
Loading checkpoint shards:  88%|????????? | 15/17 [00:53<00:06,  3.45s/it] [repeated 13x across cluster]
Loading checkpoint shards:  94%|??????????| 16/17 [00:56<00:03,  3.39s/it]
Loading checkpoint shards:  88%|????????? | 15/17 [00:52<00:06,  3.37s/it] [repeated 8x across cluster]
(WorkerDict pid=108847) Monkey patch state_dict in AutoModelForCausalLMWithValueHead. 
(WorkerDict pid=108847) Monkey patch _flash_attention_forward in transformers.integrations.flash_attention
(WorkerDict pid=108847) Skipping monkey patch for Qwen3ForCausalLM as use_fused_kernels is False or fused_kernels_backend is torch
Loading checkpoint shards: 100%|??????????| 17/17 [00:59<00:00,  3.49s/it]
(WorkerDict pid=108847) Applying LoRA to actor module
Loading checkpoint shards:  94%|??????????| 16/17 [01:00<00:03,  3.58s/it] [repeated 6x across cluster]
Loading checkpoint shards: 100%|??????????| 17/17 [00:58<00:00,  3.42s/it] [repeated 7x across cluster]
(WorkerDict pid=108846) Monkey patch state_dict in AutoModelForCausalLMWithValueHead.  [repeated 7x across cluster]
(WorkerDict pid=108846) Monkey patch _flash_attention_forward in transformers.integrations.flash_attention [repeated 7x across cluster]
(WorkerDict pid=108846) Skipping monkey patch for Qwen3ForCausalLM as use_fused_kernels is False or fused_kernels_backend is torch [repeated 7x across cluster]
(WorkerDict pid=108853) Applying LoRA to actor module [repeated 6x across cluster]
(WorkerDict pid=108846) PeftModelForCausalLM contains 32.83B parameters
(WorkerDict pid=108846) wrap_policy: functools.partial(<function _or_policy at 0xffd01af7af20>, policies=[functools.partial(<function lambda_auto_wrap_policy at 0xffd01af7b6a0>, lambda_fn=<function get_fsdp_wrap_policy.<locals>.lambda_policy_fn at 0xffcf50133600>), functools.partial(<function transformer_auto_wrap_policy at 0xffd01af7b060>, transformer_layer_cls={<class 'transformers.models.qwen3.modeling_qwen3.Qwen3DecoderLayer'>})])
(WorkerDict pid=108847) /bigdata/sde/Verl/verl/verl/utils/profiler/config.py:49: UserWarning: Torch profiler tool config is not fully supported now.
(WorkerDict pid=108847)   warnings.warn("Torch profiler tool config is not fully supported now.", stacklevel=1)
Loading checkpoint shards:  94%|??????????| 16/17 [00:55<00:03,  3.33s/it]
(WorkerDict pid=108846) Total steps: 1868, num_warmup_steps: 0
(WorkerDict pid=108846) Actor use_remove_padding=True
(WorkerDict pid=108846) Actor use_fused_kernels=False
(WorkerDict pid=108846) Applying LoRA to actor module
(WorkerDict pid=108848) [WARNING]: The memory model path /dev/shm/verl-cache/ab35dd8c922b74d470b9a5d969f8a17e/Qwen3-32B already exists. If it is not you want, please clear it and restart the task.
(WorkerDict pid=108848) [WARNING]: The memory model path /dev/shm/verl-cache/ab35dd8c922b74d470b9a5d969f8a17e/Qwen3-32B already exists. If it is not you want, please clear it and restart the task.
(WorkerDict pid=108847) WARNING 09-22 10:52:38 [registry.py:401] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
(WorkerDict pid=108847) WARNING 09-22 10:52:38 [registry.py:401] Model architecture Qwen2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2:CustomQwen2ForCausalLM.
(WorkerDict pid=108847) WARNING 09-22 10:52:38 [registry.py:401] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
(WorkerDict pid=108847) WARNING 09-22 10:52:38 [registry.py:401] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
(WorkerDict pid=108847) WARNING 09-22 10:52:38 [registry.py:401] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
(WorkerDict pid=108847) WARNING 09-22 10:52:38 [registry.py:401] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
(WorkerDict pid=108847) WARNING 09-22 10:52:38 [registry.py:401] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
(WorkerDict pid=108847) WARNING 09-22 10:52:38 [registry.py:401] Model architecture Qwen3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3:CustomQwen3ForCausalLM.
(WorkerDict pid=108848) WARNING 09-22 10:52:54 [utils.py:2737] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xffcf24471910>
(WorkerDict pid=108846) [WARNING]: The memory model path /dev/shm/verl-cache/ab35dd8c922b74d470b9a5d969f8a17e/Qwen3-32B already exists. If it is not you want, please clear it and restart the task. [repeated 22x across cluster]
(WorkerDict pid=108849) WARNING 09-22 10:52:39 [registry.py:401] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP. [repeated 7x across cluster]
(WorkerDict pid=108849) WARNING 09-22 10:52:39 [registry.py:401] Model architecture Qwen3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3:CustomQwen3ForCausalLM. [repeated 49x across cluster]
Loading safetensors checkpoint shards:   0% Completed | 0/17 [00:00<?, ?it/s]
(WorkerDict pid=108846) /bigdata/sde/Verl/verl/verl/utils/profiler/config.py:49: UserWarning: Torch profiler tool config is not fully supported now. [repeated 7x across cluster]
(WorkerDict pid=108846)   warnings.warn("Torch profiler tool config is not fully supported now.", stacklevel=1) [repeated 7x across cluster]
Loading safetensors checkpoint shards:   6% Completed | 1/17 [00:00<00:02,  6.10it/s]
Loading safetensors checkpoint shards:  12% Completed | 2/17 [00:00<00:04,  3.60it/s]
Loading safetensors checkpoint shards:  18% Completed | 3/17 [00:00<00:04,  3.19it/s]
Loading safetensors checkpoint shards:  24% Completed | 4/17 [00:01<00:03,  3.67it/s]
Loading safetensors checkpoint shards:  29% Completed | 5/17 [00:01<00:02,  4.41it/s]
Loading safetensors checkpoint shards:  35% Completed | 6/17 [00:01<00:02,  4.41it/s]
Loading safetensors checkpoint shards:  41% Completed | 7/17 [00:01<00:02,  4.45it/s]
Loading safetensors checkpoint shards:  47% Completed | 8/17 [00:01<00:02,  4.43it/s]
Loading safetensors checkpoint shards:  53% Completed | 9/17 [00:02<00:01,  4.58it/s]
Loading safetensors checkpoint shards:  59% Completed | 10/17 [00:02<00:01,  4.68it/s]
Loading safetensors checkpoint shards:  65% Completed | 11/17 [00:02<00:01,  4.76it/s]
Loading safetensors checkpoint shards:  71% Completed | 12/17 [00:02<00:01,  4.78it/s]
Loading safetensors checkpoint shards:  76% Completed | 13/17 [00:02<00:00,  4.82it/s]
Loading safetensors checkpoint shards:  82% Completed | 14/17 [00:03<00:00,  4.89it/s]
Loading safetensors checkpoint shards:  88% Completed | 15/17 [00:03<00:00,  4.99it/s]
Loading safetensors checkpoint shards:  94% Completed | 16/17 [00:03<00:00,  4.84it/s]
Loading safetensors checkpoint shards: 100% Completed | 17/17 [00:03<00:00,  4.51it/s]
Loading safetensors checkpoint shards: 100% Completed | 17/17 [00:03<00:00,  4.48it/s]
(WorkerDict pid=108846) 
(WorkerDict pid=108853) /usr/local/python3.11.13/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
(WorkerDict pid=108853)   warnings.warn(
(WorkerDict pid=108853) kwargs: {'n': 1, 'logprobs': 0, 'max_tokens': 512, 'repetition_penalty': 1.0, 'detokenize': False, 'temperature': 1.0, 'top_k': -1, 'top_p': 1, 'ignore_eos': False}
(WorkerDict pid=108850) WARNING 09-22 10:52:55 [utils.py:2737] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xffcf0079d110> [repeated 7x across cluster]
(TaskRunner pid=91866) Checkpoint tracker file does not exist: /bigdata/sde/Verl/verl/ckpts/GRPO-Qwen3/GRPO-Qwen3-32B-npu-lora/latest_checkpointed_iteration.txt
(TaskRunner pid=91866) Training from scratch
(TaskRunner pid=91866) test_gen_batch meta info: {'eos_token_id': 151645, 'pad_token_id': 151643, 'recompute_log_prob': False, 'do_sample': False, 'validate': True, 'global_steps': 0}
(WorkerDict pid=108852) kwargs: {'n': 1, 'logprobs': 0, 'max_tokens': 512, 'repetition_penalty': 1.0, 'detokenize': False, 'temperature': 1.0, 'top_k': -1, 'top_p': 1, 'ignore_eos': False} [repeated 7x across cluster]
(WorkerDict pid=108851) WARNING 09-22 10:53:31 [tokenizer.py:295] No tokenizer found in /simon-stub-path, using base model tokenizer instead. (Exception: Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: '/simon-stub-path'.)

```
The log hasn't been refreshed for an hour.
Please help me check what the issue is. Thank you.

### Expected behavior

The log hasn't been refreshed for an hour.
Please help me check what the issue is. Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GRPO + LoRA Fine-tuning Qwen3-32B Failed on 8x A2 910B3 with VLLM-Ascend 0.9.1 #3566

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GRPO + LoRA Fine-tuning Qwen3-32B Failed on 8x A2 910B3 with VLLM-Ascend 0.9.1 #3566

Description

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions