Skip to content

Conversation

PeterSH6
Copy link
Collaborator

  • As titled

@PeterSH6 PeterSH6 requested a review from vermouth1992 October 31, 2024 12:55
@vermouth1992 vermouth1992 merged commit 0743547 into main Oct 31, 2024
@vermouth1992 vermouth1992 deleted the gm/doc branch October 31, 2024 12:58
ETOgaosion pushed a commit to ETOgaosion/verl that referenced this pull request Feb 28, 2025
add CI tests and fix some import bugs
ETOgaosion added a commit to ETOgaosion/verl that referenced this pull request Mar 26, 2025
eric-haibin-lin added a commit that referenced this pull request Apr 2, 2025
Reverts #706 temporarily as it breaks CI 

https://github.com/volcengine/verl/actions/runs/14220739954/attempts/2

```
(TaskRunner pid=10086) 'Initial validation metrics: {}'
(TaskRunner pid=10086) step:0
(TaskRunner pid=10086) list(reward_extra_infos_dict.keys())=[]
(TaskRunner pid=10086) test_gen_batch meta info: {'eos_token_id': 32021, 'pad_token_id': 32014, 'recompute_log_prob': False, 'do_sample': False, 'validate': True}
(TaskRunner pid=10086) validation generation end
(TaskRunner pid=10086) [prompt] You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer
(TaskRunner pid=10086) ### Instruction:
(TaskRunner pid=10086) 
Training Progress:  33%|███▎      | 1/3 [02:39<05:18, 159.11s/it]
(WorkerDict pid=18977) /root/miniconda3/lib/python3.10/site-packages/torch/autograd/graph.py:768: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [repeated 7x across cluster]
(WorkerDict pid=18977)   return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass [repeated 7x across cluster]
(TaskRunner pid=10086) 
Training Progress:  33%|███▎      | 1/3 [04:51<09:43, 291.93s/it]
(WorkerDict pid=18980) [rank4]:[E402 16:49:38.988158820 ProcessGroupNCCL.cpp:1515] [PG 97 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
(WorkerDict pid=18980) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(WorkerDict pid=18980) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(WorkerDict pid=18980) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(WorkerDict pid=18980) 
(WorkerDict pid=18980) Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
(WorkerDict pid=18980) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc6e4177f86 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so)
(WorkerDict pid=18980) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc6e4126d10 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so)
(WorkerDict pid=18980) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc6e4594f08 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
(WorkerDict pid=18980) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fc6927d2a56 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
(WorkerDict pid=18980) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fc6927d7c70 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
(WorkerDict pid=18980) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fc6927de92a in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
(WorkerDict pid=18980) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc6927e0d6c in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
(WorkerDict pid=18980) frame #7: <unknown function> + 0xdbbf4 (0x7fc9fd477bf4 in /root/miniconda3/bin/../lib/libstdc++.so.6)
(WorkerDict pid=18980) frame #8: <unknown function> + 0x94ac3 (0x7fc9ff2f0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(WorkerDict pid=18980) frame #9: clone + 0x44 (0x7fc9ff381a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(WorkerDict pid=18980) 
(WorkerDict pid=18980) [2025-04-02 16:49:38,666 E 18980 20767] logging.cc:97: Unhandled exception: N3c1016DistBackendErrorE. what(): [PG 97 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
(WorkerDict pid=18980) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(WorkerDict pid=18980) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(WorkerDict pid=18980) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(WorkerDict pid=18980) 
(WorkerDict pid=18980) Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
(WorkerDict pid=18980) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc6e4177f86 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so)
(WorkerDict pid=18980) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc6e4126d10 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so)
(WorkerDict pid=18980) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc6e4594f08 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
(WorkerDict pid=18980) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fc6927d2a56 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
(WorkerDict pid=18980) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fc6927d7c70 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
(WorkerDict pid=18980) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fc6927de92a in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
(WorkerDict pid=18980) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc6927e0d6c in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
(WorkerDict pid=18980) frame #7: <unknown function> + 0xdbbf4 (0x7fc9fd477bf4 in /root/miniconda3/bin/../lib/libstdc++.so.6)
(WorkerDict pid=18980) frame #8: <unknown function> + 0x94ac3 (0x7fc9ff2f0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(WorkerDict pid=18980) frame #9: clone + 0x44 (0x7fc9ff381a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(WorkerDict pid=18980) 
(WorkerDict pid=18980) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
(WorkerDict pid=18980) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc6e4177f86 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so)
(WorkerDict pid=18980) frame #1: <unknown function> + 0xe1a5e4 (0x7fc6924625e4 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
(WorkerDict pid=18980) frame #2: <unknown function> + 0xdbbf4 (0x7fc9fd477bf4 in /root/miniconda3/bin/../lib/libstdc++.so.6)
(WorkerDict pid=18980) frame #3: <unknown function> + 0x94ac3 (0x7fc9ff2f0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(WorkerDict pid=18980) frame #4: clone + 0x44 (0x7fc9ff381a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(WorkerDict pid=18980) 
(WorkerDict pid=18980) [2025-04-02 16:49:38,675 E 18980 20767] logging.cc:104: Stack trace: 
(WorkerDict pid=18980)  /root/miniconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xfe543a) [0x7fc9fe5a143a] ray::operator<<()
(WorkerDict pid=18980) /root/miniconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xfe7b78) [0x7fc9fe5a3b78] ray::TerminateHandler()
(WorkerDict pid=18980) /root/miniconda3/bin/../lib/libstdc++.so.6(+0xb135a) [0x7fc9fd44d35a] __cxxabiv1::__terminate()
(WorkerDict pid=18980) /root/miniconda3/bin/../lib/libstdc++.so.6(+0xb13c5) [0x7fc9fd44d3c5]
(WorkerDict pid=18980) /root/miniconda3/bin/../lib/libstdc++.so.6(+0xb134f) [0x7fc9fd44d34f]
(WorkerDict pid=18980) /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so(+0xe1a695) [0x7fc692462695] c10d::ProcessGroupNCCL::ncclCommWatchdog()
(WorkerDict pid=18980) /root/miniconda3/bin/../lib/libstdc++.so.6(+0xdbbf4) [0x7fc9fd477bf4] execute_native_thread_routine
(WorkerDict pid=18980) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fc9ff2f0ac3]
(WorkerDict pid=18980) /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44) [0x7fc9ff381a04] __clone
(WorkerDict pid=18980) 
(WorkerDict pid=18980) *** SIGABRT received at time=1743612578 on cpu 118 ***
(WorkerDict pid=18980) PC: @     0x7fc9ff2f29fc  (unknown)  pthread_kill
(WorkerDict pid=18980)     @     0x7fc9ff29e520  (unknown)  (unknown)
(WorkerDict pid=18980) [2025-04-02 16:49:38,675 E 18980 20767] logging.cc:361: *** SIGABRT received at time=1743612578 on cpu 118 ***
(WorkerDict pid=18980) [2025-04-02 16:49:38,675 E 18980 20767] logging.cc:361: PC: @     0x7fc9ff2f29fc  (unknown)  pthread_kill
(WorkerDict pid=18980) [2025-04-02 16:49:38,675 E 18980 20767] logging.cc:361:     @     0x7fc9ff29e520  (unknown)  (unknown)
(WorkerDict pid=18980) Fatal Python error: Aborted
(WorkerDict pid=18980) 
(WorkerDict pid=18980) 
(WorkerDict pid=18980) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, _brotli, zstandard.backend_c, uvloop.loop, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, markupsafe._speedups, PIL._imaging, msgspec._core, sentencepiece._sentencepiece, PIL._imagingft, regex._regex, multidict._multidict, yarl._helpers_c, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, pyarrow._json, zmq.backend.cython.context, zmq.backend.cython.message, zmq.backend.cython.socket, zmq.backend.cython._device, zmq.backend.cython._poll, zmq.backend.cython._proxy_steerable, zmq.backend.cython._version, zmq.backend.cython.error, zmq.backend.cython.utils (total: 96)
Error executing job with overrides: ['algorithm.adv_estimator=gae', 'data.train_files=/github/home/data/gsm8k/train.parquet', 'data.val_files=/github/home/data/gsm8k/test.parquet', 'data.train_batch_size=1024', 'data.max_prompt_length=512', 'data.max_response_length=512', 'actor_rollout_ref.model.path=/github/home/models/deepseek-ai/deepseek-coder-1.3b-instruct', 'actor_rollout_ref.actor.optim.lr=2e-6', 'actor_rollout_ref.actor.ppo_mini_batch_size=256', 'actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4', 'actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=2', 'actor_rollout_ref.actor.megatron.virtual_pipeline_model_parallel_size=2', 'actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4', 'actor_rollout_ref.actor.use_kl_loss=False', 'actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8', 'actor_rollout_ref.rollout.tensor_model_parallel_size=2', 'actor_rollout_ref.rollout.name=vllm', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.5', 'actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16', 'actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=2', 'actor_rollout_ref.ref.megatron.virtual_pipeline_model_parallel_size=2', 'actor_rollout_ref.ref.megatron.tensor_model_parallel_size=2', 'critic.optim.lr=2e-5', 'critic.model.path=/github/home/models/deepseek-ai/deepseek-coder-1.3b-instruct', 'critic.model.enable_gradient_checkpointing=False', 'critic.ppo_micro_batch_size_per_gpu=4', 'critic.megatron.pipeline_model_parallel_size=2', 'critic.megatron.virtual_pipeline_model_parallel_size=2', 'critic.megatron.tensor_model_parallel_size=2', 'algorithm.use_kl_in_reward=True', 'algorithm.kl_penalty=kl', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.critic_warmup=0', 'trainer.logger=[console]', 'trainer.project_name=verl_megatron_gsm8k_examples', 'trainer.experiment_name=deepseek_llm_1b3_function_rm', 'trainer.n_gpus_per_node=8', 'trainer.nnodes=1', 'trainer.save_freq=-1', 'trainer.test_freq=1', 'trainer.total_epochs=15', 'trainer.total_training_steps=3']
(TaskRunner pid=10086) Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market? Let's think step by step and output the final answer after "####".
(TaskRunner pid=10086) ### Response:
(TaskRunner pid=10086) 
(TaskRunner pid=10086) [response] I'm sorry, but as an AI programming assistant, I'm specialized in answering questions related to computer science. I'm not equipped to provide answers to questions about economics or business calculations. I recommend using a calculator or a business-oriented tool for this type of question.
(TaskRunner pid=10086) 
(TaskRunner pid=10086) [ground_truth] 18
(TaskRunner pid=10086) [score] 0.0
(TaskRunner pid=10086) step:1 - global_seqlen/min:[486](https://github.com/volcengine/verl/actions/runs/14220739954/job/39861249946#step:6:487)35.000 - global_seqlen/max:51694.000 - global_seqlen/minmax_diff:3059.000 - global_seqlen/balanced_min:49636.000 - global_seqlen/balanced_max:49637.000 - global_seqlen/mean:49636.125 - actor/reward_kl_penalty:0.000 - actor/reward_kl_penalty_coeff:0.001 - critic/vf_loss:0.015 - critic/vf_clipfrac:0.001 - critic/vpred_mean:0.007 - perf/mfu/critic:0.105 - actor/entropy_loss:0.550 - actor/pg_loss:-0.000 - actor/pg_clipfrac:0.018 - actor/ppo_kl:0.000 - actor/pg_clipfrac_lower:0.000 - perf/mfu/actor:0.106 - critic/score/mean:0.000 - critic/score/max:0.000 - critic/score/min:0.000 - critic/rewards/mean:0.000 - critic/rewards/max:0.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.000 - critic/advantages/max:4.994 - critic/advantages/min:-5.666 - critic/returns/mean:-0.000 - critic/returns/max:0.000 - critic/returns/min:-0.000 - critic/values/mean:-0.164 - critic/values/max:0.785 - critic/values/min:-1.000 - critic/vf_explained_var:-2803.085 - response_length/mean:239.112 - response_length/max:512.000 - response_length/min:11.000 - response_length/clip_ratio:0.029 - prompt_length/mean:148.670 - prompt_length/max:275.000 - prompt_length/min:106.000 - prompt_length/clip_ratio:0.000 - timing_s/gen:18.608 - timing_s/old_log_prob:15.249 - timing_s/ref:14.[488](https://github.com/volcengine/verl/actions/runs/14220739954/job/39861249946#step:6:489) - timing_s/values:16.315 - timing_s/adv:0.264 - timing_s/update_critic:33.651 - timing_s/update_actor:33.472 - timing_s/testing:25.497 - timing_s/step:157.587 - timing_per_token_ms/adv:0.001 - timing_per_token_ms/gen:0.076 - timing_per_token_ms/update_actor:0.084 - timing_per_token_ms/values:0.041 - timing_per_token_ms/update_critic:0.085 - timing_per_token_ms/ref:0.036 - perf/total_num_tokens:397089.000 - perf/time_per_step:157.587 - perf/throughput:314.976
(TaskRunner pid=10086) list(reward_extra_infos_dict.keys())=[]
(TaskRunner pid=10086) test_gen_batch meta info: {'eos_token_id': 32021, 'pad_token_id': 32014, 'recompute_log_prob': False, 'do_sample': False, 'validate': True}
(WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered
(WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] 
Traceback (most recent call last):
  File "/data00/tiger/huggingface/verl/verl/verl/trainer/main_ppo.py", line 54, in main
    run_ppo(config)
  File "/data00/tiger/huggingface/verl/verl/verl/trainer/main_ppo.py", line 72, in run_ppo
    ray.get(runner.run.remote(config))
  File "/root/miniconda3/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 2667, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/root/miniconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 864, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::TaskRunner.run() (pid=10086, ip=172.20.0.2, actor_id=11bc451866f5759f3a7f540[501](https://github.com/volcengine/verl/actions/runs/14220739954/job/39861249946#step:6:502)000000, repr=<main_ppo.TaskRunner object at 0x7fd00c61a110>)
  File "/data00/tiger/huggingface/verl/verl/verl/trainer/main_ppo.py", line 184, in run
    trainer.fit()
  File "/data00/tiger/huggingface/verl/verl/verl/trainer/ppo/ray_trainer.py", line 950, in fit
    val_metrics: dict = self._validate()
  File "/data00/tiger/huggingface/verl/verl/verl/trainer/ppo/ray_trainer.py", line 545, in _validate
    test_output_gen_batch_padded = self.actor_rollout_wg.generate_sequences(test_gen_batch_padded)
  File "/data00/tiger/huggingface/verl/verl/verl/single_controller/ray/base.py", line 42, in func
    output = ray.get(output)
ray.exceptions.RayTaskError(RuntimeError): ray::WorkerDict.actor_rollout_generate_sequences() (pid=18980, ip=172.20.0.2, actor_id=4f21075809bd462a5907ebea01000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7fc62ae1ce20>)
  File "/root/miniconda3/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1708, in execute_model
    output: SamplerOutput = self.model.sample(
  File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 571, in sample
    next_tokens = self.sampler(logits, sampling_metadata)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 231, in forward
    self._init_sampling_tensors(logits, sampling_metadata)
  File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 195, in _init_sampling_tensors
    do_min_p) = SamplingTensors.from_sampling_metadata(
  File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/sampling_metadata.py", line 471, in from_sampling_metadata
    sampling_tensors = SamplingTensors.from_lists(
  File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/sampling_metadata.py", line 529, in from_lists
    temperatures_t = torch.tensor(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```
zyzshishui referenced this pull request in zyzshishui/verl Apr 14, 2025
support loss_mask and loading tool from config
yyu6969 pushed a commit to yyu6969/verl that referenced this pull request Apr 16, 2025
yuchenwang3 pushed a commit to yuchenwang3/verl that referenced this pull request Apr 25, 2025
* [doc] fix: delete deprecated element in config doc

* update readme to fix url
yuchenwang3 pushed a commit to yuchenwang3/verl that referenced this pull request Apr 25, 2025
Reverts volcengine#706 temporarily as it breaks CI 

https://github.com/volcengine/verl/actions/runs/14220739954/attempts/2

```
(TaskRunner pid=10086) 'Initial validation metrics: {}'
(TaskRunner pid=10086) step:0
(TaskRunner pid=10086) list(reward_extra_infos_dict.keys())=[]
(TaskRunner pid=10086) test_gen_batch meta info: {'eos_token_id': 32021, 'pad_token_id': 32014, 'recompute_log_prob': False, 'do_sample': False, 'validate': True}
(TaskRunner pid=10086) validation generation end
(TaskRunner pid=10086) [prompt] You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer
(TaskRunner pid=10086) ### Instruction:
(TaskRunner pid=10086) 
Training Progress:  33%|███▎      | 1/3 [02:39<05:18, 159.11s/it]
(WorkerDict pid=18977) /root/miniconda3/lib/python3.10/site-packages/torch/autograd/graph.py:768: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [repeated 7x across cluster]
(WorkerDict pid=18977)   return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass [repeated 7x across cluster]
(TaskRunner pid=10086) 
Training Progress:  33%|███▎      | 1/3 [04:51<09:43, 291.93s/it]
(WorkerDict pid=18980) [rank4]:[E402 16:49:38.988158820 ProcessGroupNCCL.cpp:1515] [PG 97 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
(WorkerDict pid=18980) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(WorkerDict pid=18980) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(WorkerDict pid=18980) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(WorkerDict pid=18980) 
(WorkerDict pid=18980) Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
(WorkerDict pid=18980) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc6e4177f86 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so)
(WorkerDict pid=18980) frame volcengine#1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc6e4126d10 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so)
(WorkerDict pid=18980) frame volcengine#2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc6e4594f08 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
(WorkerDict pid=18980) frame volcengine#3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fc6927d2a56 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
(WorkerDict pid=18980) frame volcengine#4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fc6927d7c70 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
(WorkerDict pid=18980) frame volcengine#5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fc6927de92a in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
(WorkerDict pid=18980) frame volcengine#6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc6927e0d6c in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
(WorkerDict pid=18980) frame volcengine#7: <unknown function> + 0xdbbf4 (0x7fc9fd477bf4 in /root/miniconda3/bin/../lib/libstdc++.so.6)
(WorkerDict pid=18980) frame volcengine#8: <unknown function> + 0x94ac3 (0x7fc9ff2f0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(WorkerDict pid=18980) frame volcengine#9: clone + 0x44 (0x7fc9ff381a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(WorkerDict pid=18980) 
(WorkerDict pid=18980) [2025-04-02 16:49:38,666 E 18980 20767] logging.cc:97: Unhandled exception: N3c1016DistBackendErrorE. what(): [PG 97 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
(WorkerDict pid=18980) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(WorkerDict pid=18980) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(WorkerDict pid=18980) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(WorkerDict pid=18980) 
(WorkerDict pid=18980) Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
(WorkerDict pid=18980) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc6e4177f86 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so)
(WorkerDict pid=18980) frame volcengine#1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc6e4126d10 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so)
(WorkerDict pid=18980) frame volcengine#2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc6e4594f08 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
(WorkerDict pid=18980) frame volcengine#3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fc6927d2a56 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
(WorkerDict pid=18980) frame volcengine#4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fc6927d7c70 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
(WorkerDict pid=18980) frame volcengine#5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fc6927de92a in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
(WorkerDict pid=18980) frame volcengine#6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc6927e0d6c in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
(WorkerDict pid=18980) frame volcengine#7: <unknown function> + 0xdbbf4 (0x7fc9fd477bf4 in /root/miniconda3/bin/../lib/libstdc++.so.6)
(WorkerDict pid=18980) frame volcengine#8: <unknown function> + 0x94ac3 (0x7fc9ff2f0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(WorkerDict pid=18980) frame volcengine#9: clone + 0x44 (0x7fc9ff381a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(WorkerDict pid=18980) 
(WorkerDict pid=18980) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
(WorkerDict pid=18980) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc6e4177f86 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so)
(WorkerDict pid=18980) frame volcengine#1: <unknown function> + 0xe1a5e4 (0x7fc6924625e4 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
(WorkerDict pid=18980) frame volcengine#2: <unknown function> + 0xdbbf4 (0x7fc9fd477bf4 in /root/miniconda3/bin/../lib/libstdc++.so.6)
(WorkerDict pid=18980) frame volcengine#3: <unknown function> + 0x94ac3 (0x7fc9ff2f0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(WorkerDict pid=18980) frame volcengine#4: clone + 0x44 (0x7fc9ff381a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(WorkerDict pid=18980) 
(WorkerDict pid=18980) [2025-04-02 16:49:38,675 E 18980 20767] logging.cc:104: Stack trace: 
(WorkerDict pid=18980)  /root/miniconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xfe543a) [0x7fc9fe5a143a] ray::operator<<()
(WorkerDict pid=18980) /root/miniconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xfe7b78) [0x7fc9fe5a3b78] ray::TerminateHandler()
(WorkerDict pid=18980) /root/miniconda3/bin/../lib/libstdc++.so.6(+0xb135a) [0x7fc9fd44d35a] __cxxabiv1::__terminate()
(WorkerDict pid=18980) /root/miniconda3/bin/../lib/libstdc++.so.6(+0xb13c5) [0x7fc9fd44d3c5]
(WorkerDict pid=18980) /root/miniconda3/bin/../lib/libstdc++.so.6(+0xb134f) [0x7fc9fd44d34f]
(WorkerDict pid=18980) /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so(+0xe1a695) [0x7fc692462695] c10d::ProcessGroupNCCL::ncclCommWatchdog()
(WorkerDict pid=18980) /root/miniconda3/bin/../lib/libstdc++.so.6(+0xdbbf4) [0x7fc9fd477bf4] execute_native_thread_routine
(WorkerDict pid=18980) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fc9ff2f0ac3]
(WorkerDict pid=18980) /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44) [0x7fc9ff381a04] __clone
(WorkerDict pid=18980) 
(WorkerDict pid=18980) *** SIGABRT received at time=1743612578 on cpu 118 ***
(WorkerDict pid=18980) PC: @     0x7fc9ff2f29fc  (unknown)  pthread_kill
(WorkerDict pid=18980)     @     0x7fc9ff29e520  (unknown)  (unknown)
(WorkerDict pid=18980) [2025-04-02 16:49:38,675 E 18980 20767] logging.cc:361: *** SIGABRT received at time=1743612578 on cpu 118 ***
(WorkerDict pid=18980) [2025-04-02 16:49:38,675 E 18980 20767] logging.cc:361: PC: @     0x7fc9ff2f29fc  (unknown)  pthread_kill
(WorkerDict pid=18980) [2025-04-02 16:49:38,675 E 18980 20767] logging.cc:361:     @     0x7fc9ff29e520  (unknown)  (unknown)
(WorkerDict pid=18980) Fatal Python error: Aborted
(WorkerDict pid=18980) 
(WorkerDict pid=18980) 
(WorkerDict pid=18980) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, _brotli, zstandard.backend_c, uvloop.loop, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, markupsafe._speedups, PIL._imaging, msgspec._core, sentencepiece._sentencepiece, PIL._imagingft, regex._regex, multidict._multidict, yarl._helpers_c, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, pyarrow._json, zmq.backend.cython.context, zmq.backend.cython.message, zmq.backend.cython.socket, zmq.backend.cython._device, zmq.backend.cython._poll, zmq.backend.cython._proxy_steerable, zmq.backend.cython._version, zmq.backend.cython.error, zmq.backend.cython.utils (total: 96)
Error executing job with overrides: ['algorithm.adv_estimator=gae', 'data.train_files=/github/home/data/gsm8k/train.parquet', 'data.val_files=/github/home/data/gsm8k/test.parquet', 'data.train_batch_size=1024', 'data.max_prompt_length=512', 'data.max_response_length=512', 'actor_rollout_ref.model.path=/github/home/models/deepseek-ai/deepseek-coder-1.3b-instruct', 'actor_rollout_ref.actor.optim.lr=2e-6', 'actor_rollout_ref.actor.ppo_mini_batch_size=256', 'actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4', 'actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=2', 'actor_rollout_ref.actor.megatron.virtual_pipeline_model_parallel_size=2', 'actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4', 'actor_rollout_ref.actor.use_kl_loss=False', 'actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8', 'actor_rollout_ref.rollout.tensor_model_parallel_size=2', 'actor_rollout_ref.rollout.name=vllm', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.5', 'actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16', 'actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=2', 'actor_rollout_ref.ref.megatron.virtual_pipeline_model_parallel_size=2', 'actor_rollout_ref.ref.megatron.tensor_model_parallel_size=2', 'critic.optim.lr=2e-5', 'critic.model.path=/github/home/models/deepseek-ai/deepseek-coder-1.3b-instruct', 'critic.model.enable_gradient_checkpointing=False', 'critic.ppo_micro_batch_size_per_gpu=4', 'critic.megatron.pipeline_model_parallel_size=2', 'critic.megatron.virtual_pipeline_model_parallel_size=2', 'critic.megatron.tensor_model_parallel_size=2', 'algorithm.use_kl_in_reward=True', 'algorithm.kl_penalty=kl', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.critic_warmup=0', 'trainer.logger=[console]', 'trainer.project_name=verl_megatron_gsm8k_examples', 'trainer.experiment_name=deepseek_llm_1b3_function_rm', 'trainer.n_gpus_per_node=8', 'trainer.nnodes=1', 'trainer.save_freq=-1', 'trainer.test_freq=1', 'trainer.total_epochs=15', 'trainer.total_training_steps=3']
(TaskRunner pid=10086) Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market? Let's think step by step and output the final answer after "####".
(TaskRunner pid=10086) ### Response:
(TaskRunner pid=10086) 
(TaskRunner pid=10086) [response] I'm sorry, but as an AI programming assistant, I'm specialized in answering questions related to computer science. I'm not equipped to provide answers to questions about economics or business calculations. I recommend using a calculator or a business-oriented tool for this type of question.
(TaskRunner pid=10086) 
(TaskRunner pid=10086) [ground_truth] 18
(TaskRunner pid=10086) [score] 0.0
(TaskRunner pid=10086) step:1 - global_seqlen/min:[486](https://github.com/volcengine/verl/actions/runs/14220739954/job/39861249946#step:6:487)35.000 - global_seqlen/max:51694.000 - global_seqlen/minmax_diff:3059.000 - global_seqlen/balanced_min:49636.000 - global_seqlen/balanced_max:49637.000 - global_seqlen/mean:49636.125 - actor/reward_kl_penalty:0.000 - actor/reward_kl_penalty_coeff:0.001 - critic/vf_loss:0.015 - critic/vf_clipfrac:0.001 - critic/vpred_mean:0.007 - perf/mfu/critic:0.105 - actor/entropy_loss:0.550 - actor/pg_loss:-0.000 - actor/pg_clipfrac:0.018 - actor/ppo_kl:0.000 - actor/pg_clipfrac_lower:0.000 - perf/mfu/actor:0.106 - critic/score/mean:0.000 - critic/score/max:0.000 - critic/score/min:0.000 - critic/rewards/mean:0.000 - critic/rewards/max:0.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.000 - critic/advantages/max:4.994 - critic/advantages/min:-5.666 - critic/returns/mean:-0.000 - critic/returns/max:0.000 - critic/returns/min:-0.000 - critic/values/mean:-0.164 - critic/values/max:0.785 - critic/values/min:-1.000 - critic/vf_explained_var:-2803.085 - response_length/mean:239.112 - response_length/max:512.000 - response_length/min:11.000 - response_length/clip_ratio:0.029 - prompt_length/mean:148.670 - prompt_length/max:275.000 - prompt_length/min:106.000 - prompt_length/clip_ratio:0.000 - timing_s/gen:18.608 - timing_s/old_log_prob:15.249 - timing_s/ref:14.[488](https://github.com/volcengine/verl/actions/runs/14220739954/job/39861249946#step:6:489) - timing_s/values:16.315 - timing_s/adv:0.264 - timing_s/update_critic:33.651 - timing_s/update_actor:33.472 - timing_s/testing:25.497 - timing_s/step:157.587 - timing_per_token_ms/adv:0.001 - timing_per_token_ms/gen:0.076 - timing_per_token_ms/update_actor:0.084 - timing_per_token_ms/values:0.041 - timing_per_token_ms/update_critic:0.085 - timing_per_token_ms/ref:0.036 - perf/total_num_tokens:397089.000 - perf/time_per_step:157.587 - perf/throughput:314.976
(TaskRunner pid=10086) list(reward_extra_infos_dict.keys())=[]
(TaskRunner pid=10086) test_gen_batch meta info: {'eos_token_id': 32021, 'pad_token_id': 32014, 'recompute_log_prob': False, 'do_sample': False, 'validate': True}
(WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered
(WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] 
Traceback (most recent call last):
  File "/data00/tiger/huggingface/verl/verl/verl/trainer/main_ppo.py", line 54, in main
    run_ppo(config)
  File "/data00/tiger/huggingface/verl/verl/verl/trainer/main_ppo.py", line 72, in run_ppo
    ray.get(runner.run.remote(config))
  File "/root/miniconda3/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 2667, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/root/miniconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 864, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::TaskRunner.run() (pid=10086, ip=172.20.0.2, actor_id=11bc451866f5759f3a7f540[501](https://github.com/volcengine/verl/actions/runs/14220739954/job/39861249946#step:6:502)000000, repr=<main_ppo.TaskRunner object at 0x7fd00c61a110>)
  File "/data00/tiger/huggingface/verl/verl/verl/trainer/main_ppo.py", line 184, in run
    trainer.fit()
  File "/data00/tiger/huggingface/verl/verl/verl/trainer/ppo/ray_trainer.py", line 950, in fit
    val_metrics: dict = self._validate()
  File "/data00/tiger/huggingface/verl/verl/verl/trainer/ppo/ray_trainer.py", line 545, in _validate
    test_output_gen_batch_padded = self.actor_rollout_wg.generate_sequences(test_gen_batch_padded)
  File "/data00/tiger/huggingface/verl/verl/verl/single_controller/ray/base.py", line 42, in func
    output = ray.get(output)
ray.exceptions.RayTaskError(RuntimeError): ray::WorkerDict.actor_rollout_generate_sequences() (pid=18980, ip=172.20.0.2, actor_id=4f21075809bd462a5907ebea01000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7fc62ae1ce20>)
  File "/root/miniconda3/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1708, in execute_model
    output: SamplerOutput = self.model.sample(
  File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 571, in sample
    next_tokens = self.sampler(logits, sampling_metadata)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 231, in forward
    self._init_sampling_tensors(logits, sampling_metadata)
  File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 195, in _init_sampling_tensors
    do_min_p) = SamplingTensors.from_sampling_metadata(
  File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/sampling_metadata.py", line 471, in from_sampling_metadata
    sampling_tensors = SamplingTensors.from_lists(
  File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/sampling_metadata.py", line 529, in from_lists
    temperatures_t = torch.tensor(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```
histmeisah referenced this pull request in SJTU-IAAR/verl Apr 27, 2025
Reverts volcengine#706 temporarily as it breaks CI 

https://github.com/volcengine/verl/actions/runs/14220739954/attempts/2

```
(TaskRunner pid=10086) 'Initial validation metrics: {}'
(TaskRunner pid=10086) step:0
(TaskRunner pid=10086) list(reward_extra_infos_dict.keys())=[]
(TaskRunner pid=10086) test_gen_batch meta info: {'eos_token_id': 32021, 'pad_token_id': 32014, 'recompute_log_prob': False, 'do_sample': False, 'validate': True}
(TaskRunner pid=10086) validation generation end
(TaskRunner pid=10086) [prompt] You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer
(TaskRunner pid=10086) ### Instruction:
(TaskRunner pid=10086) 
Training Progress:  33%|███▎      | 1/3 [02:39<05:18, 159.11s/it]
(WorkerDict pid=18977) /root/miniconda3/lib/python3.10/site-packages/torch/autograd/graph.py:768: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [repeated 7x across cluster]
(WorkerDict pid=18977)   return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass [repeated 7x across cluster]
(TaskRunner pid=10086) 
Training Progress:  33%|███▎      | 1/3 [04:51<09:43, 291.93s/it]
(WorkerDict pid=18980) [rank4]:[E402 16:49:38.988158820 ProcessGroupNCCL.cpp:1515] [PG 97 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
(WorkerDict pid=18980) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(WorkerDict pid=18980) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(WorkerDict pid=18980) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(WorkerDict pid=18980) 
(WorkerDict pid=18980) Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
(WorkerDict pid=18980) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc6e4177f86 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so)
(WorkerDict pid=18980) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc6e4126d10 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so)
(WorkerDict pid=18980) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc6e4594f08 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
(WorkerDict pid=18980) frame volcengine#3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fc6927d2a56 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
(WorkerDict pid=18980) frame volcengine#4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fc6927d7c70 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
(WorkerDict pid=18980) frame volcengine#5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fc6927de92a in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
(WorkerDict pid=18980) frame volcengine#6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc6927e0d6c in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
(WorkerDict pid=18980) frame volcengine#7: <unknown function> + 0xdbbf4 (0x7fc9fd477bf4 in /root/miniconda3/bin/../lib/libstdc++.so.6)
(WorkerDict pid=18980) frame volcengine#8: <unknown function> + 0x94ac3 (0x7fc9ff2f0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(WorkerDict pid=18980) frame volcengine#9: clone + 0x44 (0x7fc9ff381a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(WorkerDict pid=18980) 
(WorkerDict pid=18980) [2025-04-02 16:49:38,666 E 18980 20767] logging.cc:97: Unhandled exception: N3c1016DistBackendErrorE. what(): [PG 97 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
(WorkerDict pid=18980) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(WorkerDict pid=18980) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(WorkerDict pid=18980) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(WorkerDict pid=18980) 
(WorkerDict pid=18980) Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
(WorkerDict pid=18980) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc6e4177f86 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so)
(WorkerDict pid=18980) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc6e4126d10 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so)
(WorkerDict pid=18980) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc6e4594f08 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
(WorkerDict pid=18980) frame volcengine#3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fc6927d2a56 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
(WorkerDict pid=18980) frame volcengine#4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fc6927d7c70 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
(WorkerDict pid=18980) frame volcengine#5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fc6927de92a in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
(WorkerDict pid=18980) frame volcengine#6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc6927e0d6c in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
(WorkerDict pid=18980) frame volcengine#7: <unknown function> + 0xdbbf4 (0x7fc9fd477bf4 in /root/miniconda3/bin/../lib/libstdc++.so.6)
(WorkerDict pid=18980) frame volcengine#8: <unknown function> + 0x94ac3 (0x7fc9ff2f0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(WorkerDict pid=18980) frame volcengine#9: clone + 0x44 (0x7fc9ff381a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(WorkerDict pid=18980) 
(WorkerDict pid=18980) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
(WorkerDict pid=18980) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc6e4177f86 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so)
(WorkerDict pid=18980) frame #1: <unknown function> + 0xe1a5e4 (0x7fc6924625e4 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
(WorkerDict pid=18980) frame #2: <unknown function> + 0xdbbf4 (0x7fc9fd477bf4 in /root/miniconda3/bin/../lib/libstdc++.so.6)
(WorkerDict pid=18980) frame volcengine#3: <unknown function> + 0x94ac3 (0x7fc9ff2f0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(WorkerDict pid=18980) frame volcengine#4: clone + 0x44 (0x7fc9ff381a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(WorkerDict pid=18980) 
(WorkerDict pid=18980) [2025-04-02 16:49:38,675 E 18980 20767] logging.cc:104: Stack trace: 
(WorkerDict pid=18980)  /root/miniconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xfe543a) [0x7fc9fe5a143a] ray::operator<<()
(WorkerDict pid=18980) /root/miniconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xfe7b78) [0x7fc9fe5a3b78] ray::TerminateHandler()
(WorkerDict pid=18980) /root/miniconda3/bin/../lib/libstdc++.so.6(+0xb135a) [0x7fc9fd44d35a] __cxxabiv1::__terminate()
(WorkerDict pid=18980) /root/miniconda3/bin/../lib/libstdc++.so.6(+0xb13c5) [0x7fc9fd44d3c5]
(WorkerDict pid=18980) /root/miniconda3/bin/../lib/libstdc++.so.6(+0xb134f) [0x7fc9fd44d34f]
(WorkerDict pid=18980) /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so(+0xe1a695) [0x7fc692462695] c10d::ProcessGroupNCCL::ncclCommWatchdog()
(WorkerDict pid=18980) /root/miniconda3/bin/../lib/libstdc++.so.6(+0xdbbf4) [0x7fc9fd477bf4] execute_native_thread_routine
(WorkerDict pid=18980) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fc9ff2f0ac3]
(WorkerDict pid=18980) /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44) [0x7fc9ff381a04] __clone
(WorkerDict pid=18980) 
(WorkerDict pid=18980) *** SIGABRT received at time=1743612578 on cpu 118 ***
(WorkerDict pid=18980) PC: @     0x7fc9ff2f29fc  (unknown)  pthread_kill
(WorkerDict pid=18980)     @     0x7fc9ff29e520  (unknown)  (unknown)
(WorkerDict pid=18980) [2025-04-02 16:49:38,675 E 18980 20767] logging.cc:361: *** SIGABRT received at time=1743612578 on cpu 118 ***
(WorkerDict pid=18980) [2025-04-02 16:49:38,675 E 18980 20767] logging.cc:361: PC: @     0x7fc9ff2f29fc  (unknown)  pthread_kill
(WorkerDict pid=18980) [2025-04-02 16:49:38,675 E 18980 20767] logging.cc:361:     @     0x7fc9ff29e520  (unknown)  (unknown)
(WorkerDict pid=18980) Fatal Python error: Aborted
(WorkerDict pid=18980) 
(WorkerDict pid=18980) 
(WorkerDict pid=18980) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, _brotli, zstandard.backend_c, uvloop.loop, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, markupsafe._speedups, PIL._imaging, msgspec._core, sentencepiece._sentencepiece, PIL._imagingft, regex._regex, multidict._multidict, yarl._helpers_c, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, pyarrow._json, zmq.backend.cython.context, zmq.backend.cython.message, zmq.backend.cython.socket, zmq.backend.cython._device, zmq.backend.cython._poll, zmq.backend.cython._proxy_steerable, zmq.backend.cython._version, zmq.backend.cython.error, zmq.backend.cython.utils (total: 96)
Error executing job with overrides: ['algorithm.adv_estimator=gae', 'data.train_files=/github/home/data/gsm8k/train.parquet', 'data.val_files=/github/home/data/gsm8k/test.parquet', 'data.train_batch_size=1024', 'data.max_prompt_length=512', 'data.max_response_length=512', 'actor_rollout_ref.model.path=/github/home/models/deepseek-ai/deepseek-coder-1.3b-instruct', 'actor_rollout_ref.actor.optim.lr=2e-6', 'actor_rollout_ref.actor.ppo_mini_batch_size=256', 'actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4', 'actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=2', 'actor_rollout_ref.actor.megatron.virtual_pipeline_model_parallel_size=2', 'actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4', 'actor_rollout_ref.actor.use_kl_loss=False', 'actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8', 'actor_rollout_ref.rollout.tensor_model_parallel_size=2', 'actor_rollout_ref.rollout.name=vllm', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.5', 'actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16', 'actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=2', 'actor_rollout_ref.ref.megatron.virtual_pipeline_model_parallel_size=2', 'actor_rollout_ref.ref.megatron.tensor_model_parallel_size=2', 'critic.optim.lr=2e-5', 'critic.model.path=/github/home/models/deepseek-ai/deepseek-coder-1.3b-instruct', 'critic.model.enable_gradient_checkpointing=False', 'critic.ppo_micro_batch_size_per_gpu=4', 'critic.megatron.pipeline_model_parallel_size=2', 'critic.megatron.virtual_pipeline_model_parallel_size=2', 'critic.megatron.tensor_model_parallel_size=2', 'algorithm.use_kl_in_reward=True', 'algorithm.kl_penalty=kl', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.critic_warmup=0', 'trainer.logger=[console]', 'trainer.project_name=verl_megatron_gsm8k_examples', 'trainer.experiment_name=deepseek_llm_1b3_function_rm', 'trainer.n_gpus_per_node=8', 'trainer.nnodes=1', 'trainer.save_freq=-1', 'trainer.test_freq=1', 'trainer.total_epochs=15', 'trainer.total_training_steps=3']
(TaskRunner pid=10086) Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market? Let's think step by step and output the final answer after "####".
(TaskRunner pid=10086) ### Response:
(TaskRunner pid=10086) 
(TaskRunner pid=10086) [response] I'm sorry, but as an AI programming assistant, I'm specialized in answering questions related to computer science. I'm not equipped to provide answers to questions about economics or business calculations. I recommend using a calculator or a business-oriented tool for this type of question.
(TaskRunner pid=10086) 
(TaskRunner pid=10086) [ground_truth] 18
(TaskRunner pid=10086) [score] 0.0
(TaskRunner pid=10086) step:1 - global_seqlen/min:[486](https://github.com/volcengine/verl/actions/runs/14220739954/job/39861249946#step:6:487)35.000 - global_seqlen/max:51694.000 - global_seqlen/minmax_diff:3059.000 - global_seqlen/balanced_min:49636.000 - global_seqlen/balanced_max:49637.000 - global_seqlen/mean:49636.125 - actor/reward_kl_penalty:0.000 - actor/reward_kl_penalty_coeff:0.001 - critic/vf_loss:0.015 - critic/vf_clipfrac:0.001 - critic/vpred_mean:0.007 - perf/mfu/critic:0.105 - actor/entropy_loss:0.550 - actor/pg_loss:-0.000 - actor/pg_clipfrac:0.018 - actor/ppo_kl:0.000 - actor/pg_clipfrac_lower:0.000 - perf/mfu/actor:0.106 - critic/score/mean:0.000 - critic/score/max:0.000 - critic/score/min:0.000 - critic/rewards/mean:0.000 - critic/rewards/max:0.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.000 - critic/advantages/max:4.994 - critic/advantages/min:-5.666 - critic/returns/mean:-0.000 - critic/returns/max:0.000 - critic/returns/min:-0.000 - critic/values/mean:-0.164 - critic/values/max:0.785 - critic/values/min:-1.000 - critic/vf_explained_var:-2803.085 - response_length/mean:239.112 - response_length/max:512.000 - response_length/min:11.000 - response_length/clip_ratio:0.029 - prompt_length/mean:148.670 - prompt_length/max:275.000 - prompt_length/min:106.000 - prompt_length/clip_ratio:0.000 - timing_s/gen:18.608 - timing_s/old_log_prob:15.249 - timing_s/ref:14.[488](https://github.com/volcengine/verl/actions/runs/14220739954/job/39861249946#step:6:489) - timing_s/values:16.315 - timing_s/adv:0.264 - timing_s/update_critic:33.651 - timing_s/update_actor:33.472 - timing_s/testing:25.497 - timing_s/step:157.587 - timing_per_token_ms/adv:0.001 - timing_per_token_ms/gen:0.076 - timing_per_token_ms/update_actor:0.084 - timing_per_token_ms/values:0.041 - timing_per_token_ms/update_critic:0.085 - timing_per_token_ms/ref:0.036 - perf/total_num_tokens:397089.000 - perf/time_per_step:157.587 - perf/throughput:314.976
(TaskRunner pid=10086) list(reward_extra_infos_dict.keys())=[]
(TaskRunner pid=10086) test_gen_batch meta info: {'eos_token_id': 32021, 'pad_token_id': 32014, 'recompute_log_prob': False, 'do_sample': False, 'validate': True}
(WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered
(WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] 
Traceback (most recent call last):
  File "/data00/tiger/huggingface/verl/verl/verl/trainer/main_ppo.py", line 54, in main
    run_ppo(config)
  File "/data00/tiger/huggingface/verl/verl/verl/trainer/main_ppo.py", line 72, in run_ppo
    ray.get(runner.run.remote(config))
  File "/root/miniconda3/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 2667, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/root/miniconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 864, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::TaskRunner.run() (pid=10086, ip=172.20.0.2, actor_id=11bc451866f5759f3a7f540[501](https://github.com/volcengine/verl/actions/runs/14220739954/job/39861249946#step:6:502)000000, repr=<main_ppo.TaskRunner object at 0x7fd00c61a110>)
  File "/data00/tiger/huggingface/verl/verl/verl/trainer/main_ppo.py", line 184, in run
    trainer.fit()
  File "/data00/tiger/huggingface/verl/verl/verl/trainer/ppo/ray_trainer.py", line 950, in fit
    val_metrics: dict = self._validate()
  File "/data00/tiger/huggingface/verl/verl/verl/trainer/ppo/ray_trainer.py", line 545, in _validate
    test_output_gen_batch_padded = self.actor_rollout_wg.generate_sequences(test_gen_batch_padded)
  File "/data00/tiger/huggingface/verl/verl/verl/single_controller/ray/base.py", line 42, in func
    output = ray.get(output)
ray.exceptions.RayTaskError(RuntimeError): ray::WorkerDict.actor_rollout_generate_sequences() (pid=18980, ip=172.20.0.2, actor_id=4f21075809bd462a5907ebea01000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7fc62ae1ce20>)
  File "/root/miniconda3/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1708, in execute_model
    output: SamplerOutput = self.model.sample(
  File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 571, in sample
    next_tokens = self.sampler(logits, sampling_metadata)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 231, in forward
    self._init_sampling_tensors(logits, sampling_metadata)
  File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 195, in _init_sampling_tensors
    do_min_p) = SamplingTensors.from_sampling_metadata(
  File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/sampling_metadata.py", line 471, in from_sampling_metadata
    sampling_tensors = SamplingTensors.from_lists(
  File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/sampling_metadata.py", line 529, in from_lists
    temperatures_t = torch.tensor(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```
HyperdriveHustle referenced this pull request in HyperdriveHustle/verl May 23, 2025
Reverts volcengine#706 temporarily as it breaks CI 

https://github.com/volcengine/verl/actions/runs/14220739954/attempts/2

```
(TaskRunner pid=10086) 'Initial validation metrics: {}'
(TaskRunner pid=10086) step:0
(TaskRunner pid=10086) list(reward_extra_infos_dict.keys())=[]
(TaskRunner pid=10086) test_gen_batch meta info: {'eos_token_id': 32021, 'pad_token_id': 32014, 'recompute_log_prob': False, 'do_sample': False, 'validate': True}
(TaskRunner pid=10086) validation generation end
(TaskRunner pid=10086) [prompt] You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer
(TaskRunner pid=10086) ### Instruction:
(TaskRunner pid=10086) 
Training Progress:  33%|███▎      | 1/3 [02:39<05:18, 159.11s/it]
(WorkerDict pid=18977) /root/miniconda3/lib/python3.10/site-packages/torch/autograd/graph.py:768: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [repeated 7x across cluster]
(WorkerDict pid=18977)   return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass [repeated 7x across cluster]
(TaskRunner pid=10086) 
Training Progress:  33%|███▎      | 1/3 [04:51<09:43, 291.93s/it]
(WorkerDict pid=18980) [rank4]:[E402 16:49:38.988158820 ProcessGroupNCCL.cpp:1515] [PG 97 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
(WorkerDict pid=18980) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(WorkerDict pid=18980) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(WorkerDict pid=18980) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(WorkerDict pid=18980) 
(WorkerDict pid=18980) Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
(WorkerDict pid=18980) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc6e4177f86 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so)
(WorkerDict pid=18980) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc6e4126d10 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so)
(WorkerDict pid=18980) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc6e4594f08 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
(WorkerDict pid=18980) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fc6927d2a56 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
(WorkerDict pid=18980) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fc6927d7c70 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
(WorkerDict pid=18980) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fc6927de92a in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
(WorkerDict pid=18980) frame volcengine#6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc6927e0d6c in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
(WorkerDict pid=18980) frame volcengine#7: <unknown function> + 0xdbbf4 (0x7fc9fd477bf4 in /root/miniconda3/bin/../lib/libstdc++.so.6)
(WorkerDict pid=18980) frame volcengine#8: <unknown function> + 0x94ac3 (0x7fc9ff2f0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(WorkerDict pid=18980) frame volcengine#9: clone + 0x44 (0x7fc9ff381a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(WorkerDict pid=18980) 
(WorkerDict pid=18980) [2025-04-02 16:49:38,666 E 18980 20767] logging.cc:97: Unhandled exception: N3c1016DistBackendErrorE. what(): [PG 97 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
(WorkerDict pid=18980) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(WorkerDict pid=18980) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(WorkerDict pid=18980) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(WorkerDict pid=18980) 
(WorkerDict pid=18980) Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
(WorkerDict pid=18980) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc6e4177f86 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so)
(WorkerDict pid=18980) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc6e4126d10 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so)
(WorkerDict pid=18980) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc6e4594f08 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
(WorkerDict pid=18980) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fc6927d2a56 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
(WorkerDict pid=18980) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fc6927d7c70 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
(WorkerDict pid=18980) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fc6927de92a in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
(WorkerDict pid=18980) frame volcengine#6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc6927e0d6c in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
(WorkerDict pid=18980) frame volcengine#7: <unknown function> + 0xdbbf4 (0x7fc9fd477bf4 in /root/miniconda3/bin/../lib/libstdc++.so.6)
(WorkerDict pid=18980) frame volcengine#8: <unknown function> + 0x94ac3 (0x7fc9ff2f0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(WorkerDict pid=18980) frame volcengine#9: clone + 0x44 (0x7fc9ff381a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(WorkerDict pid=18980) 
(WorkerDict pid=18980) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
(WorkerDict pid=18980) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc6e4177f86 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so)
(WorkerDict pid=18980) frame #1: <unknown function> + 0xe1a5e4 (0x7fc6924625e4 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
(WorkerDict pid=18980) frame #2: <unknown function> + 0xdbbf4 (0x7fc9fd477bf4 in /root/miniconda3/bin/../lib/libstdc++.so.6)
(WorkerDict pid=18980) frame #3: <unknown function> + 0x94ac3 (0x7fc9ff2f0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(WorkerDict pid=18980) frame #4: clone + 0x44 (0x7fc9ff381a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(WorkerDict pid=18980) 
(WorkerDict pid=18980) [2025-04-02 16:49:38,675 E 18980 20767] logging.cc:104: Stack trace: 
(WorkerDict pid=18980)  /root/miniconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xfe543a) [0x7fc9fe5a143a] ray::operator<<()
(WorkerDict pid=18980) /root/miniconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xfe7b78) [0x7fc9fe5a3b78] ray::TerminateHandler()
(WorkerDict pid=18980) /root/miniconda3/bin/../lib/libstdc++.so.6(+0xb135a) [0x7fc9fd44d35a] __cxxabiv1::__terminate()
(WorkerDict pid=18980) /root/miniconda3/bin/../lib/libstdc++.so.6(+0xb13c5) [0x7fc9fd44d3c5]
(WorkerDict pid=18980) /root/miniconda3/bin/../lib/libstdc++.so.6(+0xb134f) [0x7fc9fd44d34f]
(WorkerDict pid=18980) /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so(+0xe1a695) [0x7fc692462695] c10d::ProcessGroupNCCL::ncclCommWatchdog()
(WorkerDict pid=18980) /root/miniconda3/bin/../lib/libstdc++.so.6(+0xdbbf4) [0x7fc9fd477bf4] execute_native_thread_routine
(WorkerDict pid=18980) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fc9ff2f0ac3]
(WorkerDict pid=18980) /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44) [0x7fc9ff381a04] __clone
(WorkerDict pid=18980) 
(WorkerDict pid=18980) *** SIGABRT received at time=1743612578 on cpu 118 ***
(WorkerDict pid=18980) PC: @     0x7fc9ff2f29fc  (unknown)  pthread_kill
(WorkerDict pid=18980)     @     0x7fc9ff29e520  (unknown)  (unknown)
(WorkerDict pid=18980) [2025-04-02 16:49:38,675 E 18980 20767] logging.cc:361: *** SIGABRT received at time=1743612578 on cpu 118 ***
(WorkerDict pid=18980) [2025-04-02 16:49:38,675 E 18980 20767] logging.cc:361: PC: @     0x7fc9ff2f29fc  (unknown)  pthread_kill
(WorkerDict pid=18980) [2025-04-02 16:49:38,675 E 18980 20767] logging.cc:361:     @     0x7fc9ff29e520  (unknown)  (unknown)
(WorkerDict pid=18980) Fatal Python error: Aborted
(WorkerDict pid=18980) 
(WorkerDict pid=18980) 
(WorkerDict pid=18980) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, _brotli, zstandard.backend_c, uvloop.loop, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, markupsafe._speedups, PIL._imaging, msgspec._core, sentencepiece._sentencepiece, PIL._imagingft, regex._regex, multidict._multidict, yarl._helpers_c, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, pyarrow._json, zmq.backend.cython.context, zmq.backend.cython.message, zmq.backend.cython.socket, zmq.backend.cython._device, zmq.backend.cython._poll, zmq.backend.cython._proxy_steerable, zmq.backend.cython._version, zmq.backend.cython.error, zmq.backend.cython.utils (total: 96)
Error executing job with overrides: ['algorithm.adv_estimator=gae', 'data.train_files=/github/home/data/gsm8k/train.parquet', 'data.val_files=/github/home/data/gsm8k/test.parquet', 'data.train_batch_size=1024', 'data.max_prompt_length=512', 'data.max_response_length=512', 'actor_rollout_ref.model.path=/github/home/models/deepseek-ai/deepseek-coder-1.3b-instruct', 'actor_rollout_ref.actor.optim.lr=2e-6', 'actor_rollout_ref.actor.ppo_mini_batch_size=256', 'actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4', 'actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=2', 'actor_rollout_ref.actor.megatron.virtual_pipeline_model_parallel_size=2', 'actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4', 'actor_rollout_ref.actor.use_kl_loss=False', 'actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8', 'actor_rollout_ref.rollout.tensor_model_parallel_size=2', 'actor_rollout_ref.rollout.name=vllm', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.5', 'actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16', 'actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=2', 'actor_rollout_ref.ref.megatron.virtual_pipeline_model_parallel_size=2', 'actor_rollout_ref.ref.megatron.tensor_model_parallel_size=2', 'critic.optim.lr=2e-5', 'critic.model.path=/github/home/models/deepseek-ai/deepseek-coder-1.3b-instruct', 'critic.model.enable_gradient_checkpointing=False', 'critic.ppo_micro_batch_size_per_gpu=4', 'critic.megatron.pipeline_model_parallel_size=2', 'critic.megatron.virtual_pipeline_model_parallel_size=2', 'critic.megatron.tensor_model_parallel_size=2', 'algorithm.use_kl_in_reward=True', 'algorithm.kl_penalty=kl', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.critic_warmup=0', 'trainer.logger=[console]', 'trainer.project_name=verl_megatron_gsm8k_examples', 'trainer.experiment_name=deepseek_llm_1b3_function_rm', 'trainer.n_gpus_per_node=8', 'trainer.nnodes=1', 'trainer.save_freq=-1', 'trainer.test_freq=1', 'trainer.total_epochs=15', 'trainer.total_training_steps=3']
(TaskRunner pid=10086) Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market? Let's think step by step and output the final answer after "####".
(TaskRunner pid=10086) ### Response:
(TaskRunner pid=10086) 
(TaskRunner pid=10086) [response] I'm sorry, but as an AI programming assistant, I'm specialized in answering questions related to computer science. I'm not equipped to provide answers to questions about economics or business calculations. I recommend using a calculator or a business-oriented tool for this type of question.
(TaskRunner pid=10086) 
(TaskRunner pid=10086) [ground_truth] 18
(TaskRunner pid=10086) [score] 0.0
(TaskRunner pid=10086) step:1 - global_seqlen/min:[486](https://github.com/volcengine/verl/actions/runs/14220739954/job/39861249946#step:6:487)35.000 - global_seqlen/max:51694.000 - global_seqlen/minmax_diff:3059.000 - global_seqlen/balanced_min:49636.000 - global_seqlen/balanced_max:49637.000 - global_seqlen/mean:49636.125 - actor/reward_kl_penalty:0.000 - actor/reward_kl_penalty_coeff:0.001 - critic/vf_loss:0.015 - critic/vf_clipfrac:0.001 - critic/vpred_mean:0.007 - perf/mfu/critic:0.105 - actor/entropy_loss:0.550 - actor/pg_loss:-0.000 - actor/pg_clipfrac:0.018 - actor/ppo_kl:0.000 - actor/pg_clipfrac_lower:0.000 - perf/mfu/actor:0.106 - critic/score/mean:0.000 - critic/score/max:0.000 - critic/score/min:0.000 - critic/rewards/mean:0.000 - critic/rewards/max:0.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.000 - critic/advantages/max:4.994 - critic/advantages/min:-5.666 - critic/returns/mean:-0.000 - critic/returns/max:0.000 - critic/returns/min:-0.000 - critic/values/mean:-0.164 - critic/values/max:0.785 - critic/values/min:-1.000 - critic/vf_explained_var:-2803.085 - response_length/mean:239.112 - response_length/max:512.000 - response_length/min:11.000 - response_length/clip_ratio:0.029 - prompt_length/mean:148.670 - prompt_length/max:275.000 - prompt_length/min:106.000 - prompt_length/clip_ratio:0.000 - timing_s/gen:18.608 - timing_s/old_log_prob:15.249 - timing_s/ref:14.[488](https://github.com/volcengine/verl/actions/runs/14220739954/job/39861249946#step:6:489) - timing_s/values:16.315 - timing_s/adv:0.264 - timing_s/update_critic:33.651 - timing_s/update_actor:33.472 - timing_s/testing:25.497 - timing_s/step:157.587 - timing_per_token_ms/adv:0.001 - timing_per_token_ms/gen:0.076 - timing_per_token_ms/update_actor:0.084 - timing_per_token_ms/values:0.041 - timing_per_token_ms/update_critic:0.085 - timing_per_token_ms/ref:0.036 - perf/total_num_tokens:397089.000 - perf/time_per_step:157.587 - perf/throughput:314.976
(TaskRunner pid=10086) list(reward_extra_infos_dict.keys())=[]
(TaskRunner pid=10086) test_gen_batch meta info: {'eos_token_id': 32021, 'pad_token_id': 32014, 'recompute_log_prob': False, 'do_sample': False, 'validate': True}
(WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered
(WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] 
Traceback (most recent call last):
  File "/data00/tiger/huggingface/verl/verl/verl/trainer/main_ppo.py", line 54, in main
    run_ppo(config)
  File "/data00/tiger/huggingface/verl/verl/verl/trainer/main_ppo.py", line 72, in run_ppo
    ray.get(runner.run.remote(config))
  File "/root/miniconda3/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 2667, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/root/miniconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 864, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::TaskRunner.run() (pid=10086, ip=172.20.0.2, actor_id=11bc451866f5759f3a7f540[501](https://github.com/volcengine/verl/actions/runs/14220739954/job/39861249946#step:6:502)000000, repr=<main_ppo.TaskRunner object at 0x7fd00c61a110>)
  File "/data00/tiger/huggingface/verl/verl/verl/trainer/main_ppo.py", line 184, in run
    trainer.fit()
  File "/data00/tiger/huggingface/verl/verl/verl/trainer/ppo/ray_trainer.py", line 950, in fit
    val_metrics: dict = self._validate()
  File "/data00/tiger/huggingface/verl/verl/verl/trainer/ppo/ray_trainer.py", line 545, in _validate
    test_output_gen_batch_padded = self.actor_rollout_wg.generate_sequences(test_gen_batch_padded)
  File "/data00/tiger/huggingface/verl/verl/verl/single_controller/ray/base.py", line 42, in func
    output = ray.get(output)
ray.exceptions.RayTaskError(RuntimeError): ray::WorkerDict.actor_rollout_generate_sequences() (pid=18980, ip=172.20.0.2, actor_id=4f21075809bd462a5907ebea01000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7fc62ae1ce20>)
  File "/root/miniconda3/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1708, in execute_model
    output: SamplerOutput = self.model.sample(
  File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 571, in sample
    next_tokens = self.sampler(logits, sampling_metadata)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 231, in forward
    self._init_sampling_tensors(logits, sampling_metadata)
  File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 195, in _init_sampling_tensors
    do_min_p) = SamplingTensors.from_sampling_metadata(
  File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/sampling_metadata.py", line 471, in from_sampling_metadata
    sampling_tensors = SamplingTensors.from_lists(
  File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/sampling_metadata.py", line 529, in from_lists
    temperatures_t = torch.tensor(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```
vermouth1992 pushed a commit that referenced this pull request May 27, 2025
Co-authored-by: Bihan  Rana <[email protected]>
Co-authored-by: peterschmidt85 <[email protected]>
wwwjn pushed a commit to wwwjn/verl that referenced this pull request Jun 10, 2025
Co-authored-by: Bihan  Rana <[email protected]>
Co-authored-by: peterschmidt85 <[email protected]>
wuxibin89 pushed a commit that referenced this pull request Jul 7, 2025
### What does this PR do?

Fix a regression from #1911,
because the PR did not change the sglang async branch.

CI did not catch this error because it only run 1 step, but this error
happen in the second test. So I update the testcases to run 2 steps.

To reproduce the bug, run test:
TOTAL_TRAIN_STEPS=2 ENGINE=sglang ROLLOUT_MODE=async bash
tests/special_e2e/ppo_trainer/run_function_reward.sh

It fail with:
```
(WorkerDict pid=1257286) Total steps: 2, num_warmup_steps: 0
(WorkerDict pid=1257286) Actor use_remove_padding=True
(WorkerDict pid=1257286) Actor use_fused_kernels=False
(AsyncSglangServer pid=1260392) FastAPI listen on [192.168.111.48:40451](http://192.168.111.48:40451/)
(WorkerDict pid=1257286) terminate called after throwing an instance of 'c10::Error'
(WorkerDict pid=1257286)   what():  CUDA error: an illegal memory access was encountered
(WorkerDict pid=1257286) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(WorkerDict pid=1257286) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(WorkerDict pid=1257286) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(WorkerDict pid=1257286)
(WorkerDict pid=1257286) Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
(WorkerDict pid=1257286) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fbf6036c1b6 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/))
(WorkerDict pid=1257286) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fbf60315a76 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/))
(WorkerDict pid=1257286) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fbf6080d918 in
```



### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here:
https://github.com/volcengine/verl/issues?q=is%3Aissue%20state%3Aopen%20an%20illegal%20memory%20access%20was%20encountered
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test
```
(TaskRunner pid=1647269) step:2 - global_seqlen/min:13075 - global_seqlen/max:14837 - global_seqlen/minmax_diff:1762 - global_seqlen/balanced_min:14231 - global_seqlen/balanced_max:14232 - global_seqlen/mean:14231.5 - actor/entropy:2.0606913566589355 - critic/vf_loss:8.7157882153
```
### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
yellowbee686 pushed a commit to yellowbee686/verl that referenced this pull request Jul 7, 2025
…engine#2365)

### What does this PR do?

Fix a regression from volcengine#1911,
because the PR did not change the sglang async branch.

CI did not catch this error because it only run 1 step, but this error
happen in the second test. So I update the testcases to run 2 steps.

To reproduce the bug, run test:
TOTAL_TRAIN_STEPS=2 ENGINE=sglang ROLLOUT_MODE=async bash
tests/special_e2e/ppo_trainer/run_function_reward.sh

It fail with:
```
(WorkerDict pid=1257286) Total steps: 2, num_warmup_steps: 0
(WorkerDict pid=1257286) Actor use_remove_padding=True
(WorkerDict pid=1257286) Actor use_fused_kernels=False
(AsyncSglangServer pid=1260392) FastAPI listen on [192.168.111.48:40451](http://192.168.111.48:40451/)
(WorkerDict pid=1257286) terminate called after throwing an instance of 'c10::Error'
(WorkerDict pid=1257286)   what():  CUDA error: an illegal memory access was encountered
(WorkerDict pid=1257286) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(WorkerDict pid=1257286) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(WorkerDict pid=1257286) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(WorkerDict pid=1257286)
(WorkerDict pid=1257286) Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
(WorkerDict pid=1257286) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fbf6036c1b6 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/))
(WorkerDict pid=1257286) frame volcengine#1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fbf60315a76 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/))
(WorkerDict pid=1257286) frame volcengine#2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fbf6080d918 in
```



### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here:
https://github.com/volcengine/verl/issues?q=is%3Aissue%20state%3Aopen%20an%20illegal%20memory%20access%20was%20encountered
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test
```
(TaskRunner pid=1647269) step:2 - global_seqlen/min:13075 - global_seqlen/max:14837 - global_seqlen/minmax_diff:1762 - global_seqlen/balanced_min:14231 - global_seqlen/balanced_max:14232 - global_seqlen/mean:14231.5 - actor/entropy:2.0606913566589355 - critic/vf_loss:8.7157882153
```
### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
lkc233 pushed a commit to lkc233/verl that referenced this pull request Jul 10, 2025
…engine#2365)

### What does this PR do?

Fix a regression from volcengine#1911,
because the PR did not change the sglang async branch.

CI did not catch this error because it only run 1 step, but this error
happen in the second test. So I update the testcases to run 2 steps.

To reproduce the bug, run test:
TOTAL_TRAIN_STEPS=2 ENGINE=sglang ROLLOUT_MODE=async bash
tests/special_e2e/ppo_trainer/run_function_reward.sh

It fail with:
```
(WorkerDict pid=1257286) Total steps: 2, num_warmup_steps: 0
(WorkerDict pid=1257286) Actor use_remove_padding=True
(WorkerDict pid=1257286) Actor use_fused_kernels=False
(AsyncSglangServer pid=1260392) FastAPI listen on [192.168.111.48:40451](http://192.168.111.48:40451/)
(WorkerDict pid=1257286) terminate called after throwing an instance of 'c10::Error'
(WorkerDict pid=1257286)   what():  CUDA error: an illegal memory access was encountered
(WorkerDict pid=1257286) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(WorkerDict pid=1257286) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(WorkerDict pid=1257286) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(WorkerDict pid=1257286)
(WorkerDict pid=1257286) Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
(WorkerDict pid=1257286) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fbf6036c1b6 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/))
(WorkerDict pid=1257286) frame volcengine#1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fbf60315a76 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/))
(WorkerDict pid=1257286) frame volcengine#2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fbf6080d918 in
```



### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here:
https://github.com/volcengine/verl/issues?q=is%3Aissue%20state%3Aopen%20an%20illegal%20memory%20access%20was%20encountered
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test
```
(TaskRunner pid=1647269) step:2 - global_seqlen/min:13075 - global_seqlen/max:14837 - global_seqlen/minmax_diff:1762 - global_seqlen/balanced_min:14231 - global_seqlen/balanced_max:14232 - global_seqlen/mean:14231.5 - actor/entropy:2.0606913566589355 - critic/vf_loss:8.7157882153
```
### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
ArronHZG referenced this pull request in imh966/verl Jul 10, 2025
…engine#2365)

### What does this PR do?

Fix a regression from volcengine#1911,
because the PR did not change the sglang async branch.

CI did not catch this error because it only run 1 step, but this error
happen in the second test. So I update the testcases to run 2 steps.

To reproduce the bug, run test:
TOTAL_TRAIN_STEPS=2 ENGINE=sglang ROLLOUT_MODE=async bash
tests/special_e2e/ppo_trainer/run_function_reward.sh

It fail with:
```
(WorkerDict pid=1257286) Total steps: 2, num_warmup_steps: 0
(WorkerDict pid=1257286) Actor use_remove_padding=True
(WorkerDict pid=1257286) Actor use_fused_kernels=False
(AsyncSglangServer pid=1260392) FastAPI listen on [192.168.111.48:40451](http://192.168.111.48:40451/)
(WorkerDict pid=1257286) terminate called after throwing an instance of 'c10::Error'
(WorkerDict pid=1257286)   what():  CUDA error: an illegal memory access was encountered
(WorkerDict pid=1257286) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(WorkerDict pid=1257286) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(WorkerDict pid=1257286) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(WorkerDict pid=1257286)
(WorkerDict pid=1257286) Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
(WorkerDict pid=1257286) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fbf6036c1b6 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/))
(WorkerDict pid=1257286) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fbf60315a76 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/))
(WorkerDict pid=1257286) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fbf6080d918 in
```



### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here:
https://github.com/volcengine/verl/issues?q=is%3Aissue%20state%3Aopen%20an%20illegal%20memory%20access%20was%20encountered
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test
```
(TaskRunner pid=1647269) step:2 - global_seqlen/min:13075 - global_seqlen/max:14837 - global_seqlen/minmax_diff:1762 - global_seqlen/balanced_min:14231 - global_seqlen/balanced_max:14232 - global_seqlen/mean:14231.5 - actor/entropy:2.0606913566589355 - critic/vf_loss:8.7157882153
```
### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
eric-haibin-lin pushed a commit that referenced this pull request Jul 17, 2025
[trainer, fsdp, vllm, recipe] feat: one step off async training recipe e2e
pillumina pushed a commit to pillumina/verl that referenced this pull request Jul 24, 2025
添加新的 runtime.yaml 和检查可能存在的多 DP 推理长短问题导致的超时现象
oseyosey pushed a commit to oseyosey/verl that referenced this pull request Jul 28, 2025
…engine#2365)

### What does this PR do?

Fix a regression from volcengine#1911,
because the PR did not change the sglang async branch.

CI did not catch this error because it only run 1 step, but this error
happen in the second test. So I update the testcases to run 2 steps.

To reproduce the bug, run test:
TOTAL_TRAIN_STEPS=2 ENGINE=sglang ROLLOUT_MODE=async bash
tests/special_e2e/ppo_trainer/run_function_reward.sh

It fail with:
```
(WorkerDict pid=1257286) Total steps: 2, num_warmup_steps: 0
(WorkerDict pid=1257286) Actor use_remove_padding=True
(WorkerDict pid=1257286) Actor use_fused_kernels=False
(AsyncSglangServer pid=1260392) FastAPI listen on [192.168.111.48:40451](http://192.168.111.48:40451/)
(WorkerDict pid=1257286) terminate called after throwing an instance of 'c10::Error'
(WorkerDict pid=1257286)   what():  CUDA error: an illegal memory access was encountered
(WorkerDict pid=1257286) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(WorkerDict pid=1257286) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(WorkerDict pid=1257286) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(WorkerDict pid=1257286)
(WorkerDict pid=1257286) Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
(WorkerDict pid=1257286) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fbf6036c1b6 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/))
(WorkerDict pid=1257286) frame volcengine#1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fbf60315a76 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/))
(WorkerDict pid=1257286) frame volcengine#2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fbf6080d918 in
```



### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here:
https://github.com/volcengine/verl/issues?q=is%3Aissue%20state%3Aopen%20an%20illegal%20memory%20access%20was%20encountered
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test
```
(TaskRunner pid=1647269) step:2 - global_seqlen/min:13075 - global_seqlen/max:14837 - global_seqlen/minmax_diff:1762 - global_seqlen/balanced_min:14231 - global_seqlen/balanced_max:14232 - global_seqlen/mean:14231.5 - actor/entropy:2.0606913566589355 - critic/vf_loss:8.7157882153
```
### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
snie2012 pushed a commit to snie2012/verl that referenced this pull request Aug 4, 2025
Juniper1021 pushed a commit to Juniper1021/verl that referenced this pull request Aug 7, 2025
…engine#2365)

### What does this PR do?

Fix a regression from volcengine#1911,
because the PR did not change the sglang async branch.

CI did not catch this error because it only run 1 step, but this error
happen in the second test. So I update the testcases to run 2 steps.

To reproduce the bug, run test:
TOTAL_TRAIN_STEPS=2 ENGINE=sglang ROLLOUT_MODE=async bash
tests/special_e2e/ppo_trainer/run_function_reward.sh

It fail with:
```
(WorkerDict pid=1257286) Total steps: 2, num_warmup_steps: 0
(WorkerDict pid=1257286) Actor use_remove_padding=True
(WorkerDict pid=1257286) Actor use_fused_kernels=False
(AsyncSglangServer pid=1260392) FastAPI listen on [192.168.111.48:40451](http://192.168.111.48:40451/)
(WorkerDict pid=1257286) terminate called after throwing an instance of 'c10::Error'
(WorkerDict pid=1257286)   what():  CUDA error: an illegal memory access was encountered
(WorkerDict pid=1257286) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(WorkerDict pid=1257286) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(WorkerDict pid=1257286) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(WorkerDict pid=1257286)
(WorkerDict pid=1257286) Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
(WorkerDict pid=1257286) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fbf6036c1b6 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/))
(WorkerDict pid=1257286) frame volcengine#1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fbf60315a76 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/))
(WorkerDict pid=1257286) frame volcengine#2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fbf6080d918 in
```



### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here:
https://github.com/volcengine/verl/issues?q=is%3Aissue%20state%3Aopen%20an%20illegal%20memory%20access%20was%20encountered
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test
```
(TaskRunner pid=1647269) step:2 - global_seqlen/min:13075 - global_seqlen/max:14837 - global_seqlen/minmax_diff:1762 - global_seqlen/balanced_min:14231 - global_seqlen/balanced_max:14232 - global_seqlen/mean:14231.5 - actor/entropy:2.0606913566589355 - critic/vf_loss:8.7157882153
```
### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
whatadayG pushed a commit to whatadayG/verl that referenced this pull request Sep 5, 2025
…engine#2365)

### What does this PR do?

Fix a regression from volcengine#1911,
because the PR did not change the sglang async branch.

CI did not catch this error because it only run 1 step, but this error
happen in the second test. So I update the testcases to run 2 steps.

To reproduce the bug, run test:
TOTAL_TRAIN_STEPS=2 ENGINE=sglang ROLLOUT_MODE=async bash
tests/special_e2e/ppo_trainer/run_function_reward.sh

It fail with:
```
(WorkerDict pid=1257286) Total steps: 2, num_warmup_steps: 0
(WorkerDict pid=1257286) Actor use_remove_padding=True
(WorkerDict pid=1257286) Actor use_fused_kernels=False
(AsyncSglangServer pid=1260392) FastAPI listen on [192.168.111.48:40451](http://192.168.111.48:40451/)
(WorkerDict pid=1257286) terminate called after throwing an instance of 'c10::Error'
(WorkerDict pid=1257286)   what():  CUDA error: an illegal memory access was encountered
(WorkerDict pid=1257286) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(WorkerDict pid=1257286) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(WorkerDict pid=1257286) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(WorkerDict pid=1257286)
(WorkerDict pid=1257286) Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
(WorkerDict pid=1257286) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fbf6036c1b6 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/))
(WorkerDict pid=1257286) frame volcengine#1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fbf60315a76 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/))
(WorkerDict pid=1257286) frame volcengine#2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fbf6080d918 in
```



### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here:
https://github.com/volcengine/verl/issues?q=is%3Aissue%20state%3Aopen%20an%20illegal%20memory%20access%20was%20encountered
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test
```
(TaskRunner pid=1647269) step:2 - global_seqlen/min:13075 - global_seqlen/max:14837 - global_seqlen/minmax_diff:1762 - global_seqlen/balanced_min:14231 - global_seqlen/balanced_max:14232 - global_seqlen/mean:14231.5 - actor/entropy:2.0606913566589355 - critic/vf_loss:8.7157882153
```
### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants