[misc] fix: resolve pypi missing directory #3

PeterSH6 · 2024-11-01T05:01:48Z

Add an init file for the Megatron model to resolve Pypi missing directory
Need to delete release v0.1.1 and revert back to v0.1
Fix torchrun issue
Lint the files

…un issue

* add tp support to torch's ops, for correctness checking Signed-off-by: Jianbing Dong <[email protected]> * removed redundant d_scale Signed-off-by: Jianbing Dong <[email protected]> * add tp along vocab_size dimension Signed-off-by: Jianbing Dong <[email protected]> * merge accumulate & entropy_b to one buffer Signed-off-by: Jianbing Dong <[email protected]> * add dedicated stream for overlapping _logprobs Signed-off-by: Jianbing Dong <[email protected]> * format Signed-off-by: Jianbing Dong <[email protected]> * update test api Signed-off-by: Jianbing Dong <[email protected]> --------- Signed-off-by: Jianbing Dong <[email protected]>

Reverts #706 temporarily as it breaks CI https://github.com/volcengine/verl/actions/runs/14220739954/attempts/2 ``` (TaskRunner pid=10086) 'Initial validation metrics: {}' (TaskRunner pid=10086) step:0 (TaskRunner pid=10086) list(reward_extra_infos_dict.keys())=[] (TaskRunner pid=10086) test_gen_batch meta info: {'eos_token_id': 32021, 'pad_token_id': 32014, 'recompute_log_prob': False, 'do_sample': False, 'validate': True} (TaskRunner pid=10086) validation generation end (TaskRunner pid=10086) [prompt] You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer (TaskRunner pid=10086) ### Instruction: (TaskRunner pid=10086) Training Progress: 33%|███▎ | 1/3 [02:39<05:18, 159.11s/it] (WorkerDict pid=18977) /root/miniconda3/lib/python3.10/site-packages/torch/autograd/graph.py:768: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [repeated 7x across cluster] (WorkerDict pid=18977) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [repeated 7x across cluster] (TaskRunner pid=10086) Training Progress: 33%|███▎ | 1/3 [04:51<09:43, 291.93s/it] (WorkerDict pid=18980) [rank4]:[E402 16:49:38.988158820 ProcessGroupNCCL.cpp:1515] [PG 97 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered (WorkerDict pid=18980) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (WorkerDict pid=18980) For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (WorkerDict pid=18980) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (WorkerDict pid=18980) (WorkerDict pid=18980) Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): (WorkerDict pid=18980) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc6e4177f86 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so) (WorkerDict pid=18980) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc6e4126d10 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so) (WorkerDict pid=18980) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc6e4594f08 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10_cuda.so) (WorkerDict pid=18980) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fc6927d2a56 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fc6927d7c70 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fc6927de92a in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc6927e0d6c in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame #7: <unknown function> + 0xdbbf4 (0x7fc9fd477bf4 in /root/miniconda3/bin/../lib/libstdc++.so.6) (WorkerDict pid=18980) frame #8: <unknown function> + 0x94ac3 (0x7fc9ff2f0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=18980) frame #9: clone + 0x44 (0x7fc9ff381a04 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=18980) (WorkerDict pid=18980) [2025-04-02 16:49:38,666 E 18980 20767] logging.cc:97: Unhandled exception: N3c1016DistBackendErrorE. what(): [PG 97 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered (WorkerDict pid=18980) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (WorkerDict pid=18980) For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (WorkerDict pid=18980) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (WorkerDict pid=18980) (WorkerDict pid=18980) Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): (WorkerDict pid=18980) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc6e4177f86 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so) (WorkerDict pid=18980) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc6e4126d10 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so) (WorkerDict pid=18980) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc6e4594f08 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10_cuda.so) (WorkerDict pid=18980) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fc6927d2a56 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fc6927d7c70 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fc6927de92a in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc6927e0d6c in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame #7: <unknown function> + 0xdbbf4 (0x7fc9fd477bf4 in /root/miniconda3/bin/../lib/libstdc++.so.6) (WorkerDict pid=18980) frame #8: <unknown function> + 0x94ac3 (0x7fc9ff2f0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=18980) frame #9: clone + 0x44 (0x7fc9ff381a04 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=18980) (WorkerDict pid=18980) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first): (WorkerDict pid=18980) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc6e4177f86 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so) (WorkerDict pid=18980) frame #1: <unknown function> + 0xe1a5e4 (0x7fc6924625e4 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame #2: <unknown function> + 0xdbbf4 (0x7fc9fd477bf4 in /root/miniconda3/bin/../lib/libstdc++.so.6) (WorkerDict pid=18980) frame #3: <unknown function> + 0x94ac3 (0x7fc9ff2f0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=18980) frame #4: clone + 0x44 (0x7fc9ff381a04 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=18980) (WorkerDict pid=18980) [2025-04-02 16:49:38,675 E 18980 20767] logging.cc:104: Stack trace: (WorkerDict pid=18980) /root/miniconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xfe543a) [0x7fc9fe5a143a] ray::operator<<() (WorkerDict pid=18980) /root/miniconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xfe7b78) [0x7fc9fe5a3b78] ray::TerminateHandler() (WorkerDict pid=18980) /root/miniconda3/bin/../lib/libstdc++.so.6(+0xb135a) [0x7fc9fd44d35a] __cxxabiv1::__terminate() (WorkerDict pid=18980) /root/miniconda3/bin/../lib/libstdc++.so.6(+0xb13c5) [0x7fc9fd44d3c5] (WorkerDict pid=18980) /root/miniconda3/bin/../lib/libstdc++.so.6(+0xb134f) [0x7fc9fd44d34f] (WorkerDict pid=18980) /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so(+0xe1a695) [0x7fc692462695] c10d::ProcessGroupNCCL::ncclCommWatchdog() (WorkerDict pid=18980) /root/miniconda3/bin/../lib/libstdc++.so.6(+0xdbbf4) [0x7fc9fd477bf4] execute_native_thread_routine (WorkerDict pid=18980) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fc9ff2f0ac3] (WorkerDict pid=18980) /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44) [0x7fc9ff381a04] __clone (WorkerDict pid=18980) (WorkerDict pid=18980) *** SIGABRT received at time=1743612578 on cpu 118 *** (WorkerDict pid=18980) PC: @ 0x7fc9ff2f29fc (unknown) pthread_kill (WorkerDict pid=18980) @ 0x7fc9ff29e520 (unknown) (unknown) (WorkerDict pid=18980) [2025-04-02 16:49:38,675 E 18980 20767] logging.cc:361: *** SIGABRT received at time=1743612578 on cpu 118 *** (WorkerDict pid=18980) [2025-04-02 16:49:38,675 E 18980 20767] logging.cc:361: PC: @ 0x7fc9ff2f29fc (unknown) pthread_kill (WorkerDict pid=18980) [2025-04-02 16:49:38,675 E 18980 20767] logging.cc:361: @ 0x7fc9ff29e520 (unknown) (unknown) (WorkerDict pid=18980) Fatal Python error: Aborted (WorkerDict pid=18980) (WorkerDict pid=18980) (WorkerDict pid=18980) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, _brotli, zstandard.backend_c, uvloop.loop, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, markupsafe._speedups, PIL._imaging, msgspec._core, sentencepiece._sentencepiece, PIL._imagingft, regex._regex, multidict._multidict, yarl._helpers_c, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, pyarrow._json, zmq.backend.cython.context, zmq.backend.cython.message, zmq.backend.cython.socket, zmq.backend.cython._device, zmq.backend.cython._poll, zmq.backend.cython._proxy_steerable, zmq.backend.cython._version, zmq.backend.cython.error, zmq.backend.cython.utils (total: 96) Error executing job with overrides: ['algorithm.adv_estimator=gae', 'data.train_files=/github/home/data/gsm8k/train.parquet', 'data.val_files=/github/home/data/gsm8k/test.parquet', 'data.train_batch_size=1024', 'data.max_prompt_length=512', 'data.max_response_length=512', 'actor_rollout_ref.model.path=/github/home/models/deepseek-ai/deepseek-coder-1.3b-instruct', 'actor_rollout_ref.actor.optim.lr=2e-6', 'actor_rollout_ref.actor.ppo_mini_batch_size=256', 'actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4', 'actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=2', 'actor_rollout_ref.actor.megatron.virtual_pipeline_model_parallel_size=2', 'actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4', 'actor_rollout_ref.actor.use_kl_loss=False', 'actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8', 'actor_rollout_ref.rollout.tensor_model_parallel_size=2', 'actor_rollout_ref.rollout.name=vllm', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.5', 'actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16', 'actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=2', 'actor_rollout_ref.ref.megatron.virtual_pipeline_model_parallel_size=2', 'actor_rollout_ref.ref.megatron.tensor_model_parallel_size=2', 'critic.optim.lr=2e-5', 'critic.model.path=/github/home/models/deepseek-ai/deepseek-coder-1.3b-instruct', 'critic.model.enable_gradient_checkpointing=False', 'critic.ppo_micro_batch_size_per_gpu=4', 'critic.megatron.pipeline_model_parallel_size=2', 'critic.megatron.virtual_pipeline_model_parallel_size=2', 'critic.megatron.tensor_model_parallel_size=2', 'algorithm.use_kl_in_reward=True', 'algorithm.kl_penalty=kl', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.critic_warmup=0', 'trainer.logger=[console]', 'trainer.project_name=verl_megatron_gsm8k_examples', 'trainer.experiment_name=deepseek_llm_1b3_function_rm', 'trainer.n_gpus_per_node=8', 'trainer.nnodes=1', 'trainer.save_freq=-1', 'trainer.test_freq=1', 'trainer.total_epochs=15', 'trainer.total_training_steps=3'] (TaskRunner pid=10086) Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market? Let's think step by step and output the final answer after "####". (TaskRunner pid=10086) ### Response: (TaskRunner pid=10086) (TaskRunner pid=10086) [response] I'm sorry, but as an AI programming assistant, I'm specialized in answering questions related to computer science. I'm not equipped to provide answers to questions about economics or business calculations. I recommend using a calculator or a business-oriented tool for this type of question. (TaskRunner pid=10086) (TaskRunner pid=10086) [ground_truth] 18 (TaskRunner pid=10086) [score] 0.0 (TaskRunner pid=10086) step:1 - global_seqlen/min:[486](https://github.com/volcengine/verl/actions/runs/14220739954/job/39861249946#step:6:487)35.000 - global_seqlen/max:51694.000 - global_seqlen/minmax_diff:3059.000 - global_seqlen/balanced_min:49636.000 - global_seqlen/balanced_max:49637.000 - global_seqlen/mean:49636.125 - actor/reward_kl_penalty:0.000 - actor/reward_kl_penalty_coeff:0.001 - critic/vf_loss:0.015 - critic/vf_clipfrac:0.001 - critic/vpred_mean:0.007 - perf/mfu/critic:0.105 - actor/entropy_loss:0.550 - actor/pg_loss:-0.000 - actor/pg_clipfrac:0.018 - actor/ppo_kl:0.000 - actor/pg_clipfrac_lower:0.000 - perf/mfu/actor:0.106 - critic/score/mean:0.000 - critic/score/max:0.000 - critic/score/min:0.000 - critic/rewards/mean:0.000 - critic/rewards/max:0.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.000 - critic/advantages/max:4.994 - critic/advantages/min:-5.666 - critic/returns/mean:-0.000 - critic/returns/max:0.000 - critic/returns/min:-0.000 - critic/values/mean:-0.164 - critic/values/max:0.785 - critic/values/min:-1.000 - critic/vf_explained_var:-2803.085 - response_length/mean:239.112 - response_length/max:512.000 - response_length/min:11.000 - response_length/clip_ratio:0.029 - prompt_length/mean:148.670 - prompt_length/max:275.000 - prompt_length/min:106.000 - prompt_length/clip_ratio:0.000 - timing_s/gen:18.608 - timing_s/old_log_prob:15.249 - timing_s/ref:14.[488](https://github.com/volcengine/verl/actions/runs/14220739954/job/39861249946#step:6:489) - timing_s/values:16.315 - timing_s/adv:0.264 - timing_s/update_critic:33.651 - timing_s/update_actor:33.472 - timing_s/testing:25.497 - timing_s/step:157.587 - timing_per_token_ms/adv:0.001 - timing_per_token_ms/gen:0.076 - timing_per_token_ms/update_actor:0.084 - timing_per_token_ms/values:0.041 - timing_per_token_ms/update_critic:0.085 - timing_per_token_ms/ref:0.036 - perf/total_num_tokens:397089.000 - perf/time_per_step:157.587 - perf/throughput:314.976 (TaskRunner pid=10086) list(reward_extra_infos_dict.keys())=[] (TaskRunner pid=10086) test_gen_batch meta info: {'eos_token_id': 32021, 'pad_token_id': 32014, 'recompute_log_prob': False, 'do_sample': False, 'validate': True} (WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered (WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] Traceback (most recent call last): File "/data00/tiger/huggingface/verl/verl/verl/trainer/main_ppo.py", line 54, in main run_ppo(config) File "/data00/tiger/huggingface/verl/verl/verl/trainer/main_ppo.py", line 72, in run_ppo ray.get(runner.run.remote(config)) File "/root/miniconda3/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, **kwargs) File "/root/miniconda3/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) File "/root/miniconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 2667, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) File "/root/miniconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 864, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(RuntimeError): ray::TaskRunner.run() (pid=10086, ip=172.20.0.2, actor_id=11bc451866f5759f3a7f540[501](https://github.com/volcengine/verl/actions/runs/14220739954/job/39861249946#step:6:502)000000, repr=<main_ppo.TaskRunner object at 0x7fd00c61a110>) File "/data00/tiger/huggingface/verl/verl/verl/trainer/main_ppo.py", line 184, in run trainer.fit() File "/data00/tiger/huggingface/verl/verl/verl/trainer/ppo/ray_trainer.py", line 950, in fit val_metrics: dict = self._validate() File "/data00/tiger/huggingface/verl/verl/verl/trainer/ppo/ray_trainer.py", line 545, in _validate test_output_gen_batch_padded = self.actor_rollout_wg.generate_sequences(test_gen_batch_padded) File "/data00/tiger/huggingface/verl/verl/verl/single_controller/ray/base.py", line 42, in func output = ray.get(output) ray.exceptions.RayTaskError(RuntimeError): ray::WorkerDict.actor_rollout_generate_sequences() (pid=18980, ip=172.20.0.2, actor_id=4f21075809bd462a5907ebea01000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7fc62ae1ce20>) File "/root/miniconda3/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1708, in execute_model output: SamplerOutput = self.model.sample( File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 571, in sample next_tokens = self.sampler(logits, sampling_metadata) File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 231, in forward self._init_sampling_tensors(logits, sampling_metadata) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 195, in _init_sampling_tensors do_min_p) = SamplingTensors.from_sampling_metadata( File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/sampling_metadata.py", line 471, in from_sampling_metadata sampling_tensors = SamplingTensors.from_lists( File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/sampling_metadata.py", line 529, in from_lists temperatures_t = torch.tensor( RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. ```

* [misc] fix: fix pypi package missing of megatron model and fix torchrun issue * lint * fix sft script * update version

Reverts volcengine#706 temporarily as it breaks CI https://github.com/volcengine/verl/actions/runs/14220739954/attempts/2 ``` (TaskRunner pid=10086) 'Initial validation metrics: {}' (TaskRunner pid=10086) step:0 (TaskRunner pid=10086) list(reward_extra_infos_dict.keys())=[] (TaskRunner pid=10086) test_gen_batch meta info: {'eos_token_id': 32021, 'pad_token_id': 32014, 'recompute_log_prob': False, 'do_sample': False, 'validate': True} (TaskRunner pid=10086) validation generation end (TaskRunner pid=10086) [prompt] You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer (TaskRunner pid=10086) ### Instruction: (TaskRunner pid=10086) Training Progress: 33%|███▎ | 1/3 [02:39<05:18, 159.11s/it] (WorkerDict pid=18977) /root/miniconda3/lib/python3.10/site-packages/torch/autograd/graph.py:768: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [repeated 7x across cluster] (WorkerDict pid=18977) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [repeated 7x across cluster] (TaskRunner pid=10086) Training Progress: 33%|███▎ | 1/3 [04:51<09:43, 291.93s/it] (WorkerDict pid=18980) [rank4]:[E402 16:49:38.988158820 ProcessGroupNCCL.cpp:1515] [PG 97 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered (WorkerDict pid=18980) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (WorkerDict pid=18980) For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (WorkerDict pid=18980) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (WorkerDict pid=18980) (WorkerDict pid=18980) Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): (WorkerDict pid=18980) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc6e4177f86 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so) (WorkerDict pid=18980) frame volcengine#1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc6e4126d10 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so) (WorkerDict pid=18980) frame volcengine#2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc6e4594f08 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10_cuda.so) (WorkerDict pid=18980) frame volcengine#3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fc6927d2a56 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame volcengine#4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fc6927d7c70 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame volcengine#5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fc6927de92a in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame volcengine#6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc6927e0d6c in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame volcengine#7: <unknown function> + 0xdbbf4 (0x7fc9fd477bf4 in /root/miniconda3/bin/../lib/libstdc++.so.6) (WorkerDict pid=18980) frame volcengine#8: <unknown function> + 0x94ac3 (0x7fc9ff2f0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=18980) frame volcengine#9: clone + 0x44 (0x7fc9ff381a04 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=18980) (WorkerDict pid=18980) [2025-04-02 16:49:38,666 E 18980 20767] logging.cc:97: Unhandled exception: N3c1016DistBackendErrorE. what(): [PG 97 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered (WorkerDict pid=18980) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (WorkerDict pid=18980) For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (WorkerDict pid=18980) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (WorkerDict pid=18980) (WorkerDict pid=18980) Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): (WorkerDict pid=18980) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc6e4177f86 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so) (WorkerDict pid=18980) frame volcengine#1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc6e4126d10 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so) (WorkerDict pid=18980) frame volcengine#2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc6e4594f08 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10_cuda.so) (WorkerDict pid=18980) frame volcengine#3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fc6927d2a56 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame volcengine#4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fc6927d7c70 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame volcengine#5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fc6927de92a in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame volcengine#6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc6927e0d6c in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame volcengine#7: <unknown function> + 0xdbbf4 (0x7fc9fd477bf4 in /root/miniconda3/bin/../lib/libstdc++.so.6) (WorkerDict pid=18980) frame volcengine#8: <unknown function> + 0x94ac3 (0x7fc9ff2f0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=18980) frame volcengine#9: clone + 0x44 (0x7fc9ff381a04 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=18980) (WorkerDict pid=18980) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first): (WorkerDict pid=18980) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc6e4177f86 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so) (WorkerDict pid=18980) frame volcengine#1: <unknown function> + 0xe1a5e4 (0x7fc6924625e4 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame volcengine#2: <unknown function> + 0xdbbf4 (0x7fc9fd477bf4 in /root/miniconda3/bin/../lib/libstdc++.so.6) (WorkerDict pid=18980) frame volcengine#3: <unknown function> + 0x94ac3 (0x7fc9ff2f0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=18980) frame volcengine#4: clone + 0x44 (0x7fc9ff381a04 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=18980) (WorkerDict pid=18980) [2025-04-02 16:49:38,675 E 18980 20767] logging.cc:104: Stack trace: (WorkerDict pid=18980) /root/miniconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xfe543a) [0x7fc9fe5a143a] ray::operator<<() (WorkerDict pid=18980) /root/miniconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xfe7b78) [0x7fc9fe5a3b78] ray::TerminateHandler() (WorkerDict pid=18980) /root/miniconda3/bin/../lib/libstdc++.so.6(+0xb135a) [0x7fc9fd44d35a] __cxxabiv1::__terminate() (WorkerDict pid=18980) /root/miniconda3/bin/../lib/libstdc++.so.6(+0xb13c5) [0x7fc9fd44d3c5] (WorkerDict pid=18980) /root/miniconda3/bin/../lib/libstdc++.so.6(+0xb134f) [0x7fc9fd44d34f] (WorkerDict pid=18980) /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so(+0xe1a695) [0x7fc692462695] c10d::ProcessGroupNCCL::ncclCommWatchdog() (WorkerDict pid=18980) /root/miniconda3/bin/../lib/libstdc++.so.6(+0xdbbf4) [0x7fc9fd477bf4] execute_native_thread_routine (WorkerDict pid=18980) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fc9ff2f0ac3] (WorkerDict pid=18980) /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44) [0x7fc9ff381a04] __clone (WorkerDict pid=18980) (WorkerDict pid=18980) *** SIGABRT received at time=1743612578 on cpu 118 *** (WorkerDict pid=18980) PC: @ 0x7fc9ff2f29fc (unknown) pthread_kill (WorkerDict pid=18980) @ 0x7fc9ff29e520 (unknown) (unknown) (WorkerDict pid=18980) [2025-04-02 16:49:38,675 E 18980 20767] logging.cc:361: *** SIGABRT received at time=1743612578 on cpu 118 *** (WorkerDict pid=18980) [2025-04-02 16:49:38,675 E 18980 20767] logging.cc:361: PC: @ 0x7fc9ff2f29fc (unknown) pthread_kill (WorkerDict pid=18980) [2025-04-02 16:49:38,675 E 18980 20767] logging.cc:361: @ 0x7fc9ff29e520 (unknown) (unknown) (WorkerDict pid=18980) Fatal Python error: Aborted (WorkerDict pid=18980) (WorkerDict pid=18980) (WorkerDict pid=18980) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, _brotli, zstandard.backend_c, uvloop.loop, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, markupsafe._speedups, PIL._imaging, msgspec._core, sentencepiece._sentencepiece, PIL._imagingft, regex._regex, multidict._multidict, yarl._helpers_c, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, pyarrow._json, zmq.backend.cython.context, zmq.backend.cython.message, zmq.backend.cython.socket, zmq.backend.cython._device, zmq.backend.cython._poll, zmq.backend.cython._proxy_steerable, zmq.backend.cython._version, zmq.backend.cython.error, zmq.backend.cython.utils (total: 96) Error executing job with overrides: ['algorithm.adv_estimator=gae', 'data.train_files=/github/home/data/gsm8k/train.parquet', 'data.val_files=/github/home/data/gsm8k/test.parquet', 'data.train_batch_size=1024', 'data.max_prompt_length=512', 'data.max_response_length=512', 'actor_rollout_ref.model.path=/github/home/models/deepseek-ai/deepseek-coder-1.3b-instruct', 'actor_rollout_ref.actor.optim.lr=2e-6', 'actor_rollout_ref.actor.ppo_mini_batch_size=256', 'actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4', 'actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=2', 'actor_rollout_ref.actor.megatron.virtual_pipeline_model_parallel_size=2', 'actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4', 'actor_rollout_ref.actor.use_kl_loss=False', 'actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8', 'actor_rollout_ref.rollout.tensor_model_parallel_size=2', 'actor_rollout_ref.rollout.name=vllm', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.5', 'actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16', 'actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=2', 'actor_rollout_ref.ref.megatron.virtual_pipeline_model_parallel_size=2', 'actor_rollout_ref.ref.megatron.tensor_model_parallel_size=2', 'critic.optim.lr=2e-5', 'critic.model.path=/github/home/models/deepseek-ai/deepseek-coder-1.3b-instruct', 'critic.model.enable_gradient_checkpointing=False', 'critic.ppo_micro_batch_size_per_gpu=4', 'critic.megatron.pipeline_model_parallel_size=2', 'critic.megatron.virtual_pipeline_model_parallel_size=2', 'critic.megatron.tensor_model_parallel_size=2', 'algorithm.use_kl_in_reward=True', 'algorithm.kl_penalty=kl', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.critic_warmup=0', 'trainer.logger=[console]', 'trainer.project_name=verl_megatron_gsm8k_examples', 'trainer.experiment_name=deepseek_llm_1b3_function_rm', 'trainer.n_gpus_per_node=8', 'trainer.nnodes=1', 'trainer.save_freq=-1', 'trainer.test_freq=1', 'trainer.total_epochs=15', 'trainer.total_training_steps=3'] (TaskRunner pid=10086) Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market? Let's think step by step and output the final answer after "####". (TaskRunner pid=10086) ### Response: (TaskRunner pid=10086) (TaskRunner pid=10086) [response] I'm sorry, but as an AI programming assistant, I'm specialized in answering questions related to computer science. I'm not equipped to provide answers to questions about economics or business calculations. I recommend using a calculator or a business-oriented tool for this type of question. (TaskRunner pid=10086) (TaskRunner pid=10086) [ground_truth] 18 (TaskRunner pid=10086) [score] 0.0 (TaskRunner pid=10086) step:1 - global_seqlen/min:[486](https://github.com/volcengine/verl/actions/runs/14220739954/job/39861249946#step:6:487)35.000 - global_seqlen/max:51694.000 - global_seqlen/minmax_diff:3059.000 - global_seqlen/balanced_min:49636.000 - global_seqlen/balanced_max:49637.000 - global_seqlen/mean:49636.125 - actor/reward_kl_penalty:0.000 - actor/reward_kl_penalty_coeff:0.001 - critic/vf_loss:0.015 - critic/vf_clipfrac:0.001 - critic/vpred_mean:0.007 - perf/mfu/critic:0.105 - actor/entropy_loss:0.550 - actor/pg_loss:-0.000 - actor/pg_clipfrac:0.018 - actor/ppo_kl:0.000 - actor/pg_clipfrac_lower:0.000 - perf/mfu/actor:0.106 - critic/score/mean:0.000 - critic/score/max:0.000 - critic/score/min:0.000 - critic/rewards/mean:0.000 - critic/rewards/max:0.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.000 - critic/advantages/max:4.994 - critic/advantages/min:-5.666 - critic/returns/mean:-0.000 - critic/returns/max:0.000 - critic/returns/min:-0.000 - critic/values/mean:-0.164 - critic/values/max:0.785 - critic/values/min:-1.000 - critic/vf_explained_var:-2803.085 - response_length/mean:239.112 - response_length/max:512.000 - response_length/min:11.000 - response_length/clip_ratio:0.029 - prompt_length/mean:148.670 - prompt_length/max:275.000 - prompt_length/min:106.000 - prompt_length/clip_ratio:0.000 - timing_s/gen:18.608 - timing_s/old_log_prob:15.249 - timing_s/ref:14.[488](https://github.com/volcengine/verl/actions/runs/14220739954/job/39861249946#step:6:489) - timing_s/values:16.315 - timing_s/adv:0.264 - timing_s/update_critic:33.651 - timing_s/update_actor:33.472 - timing_s/testing:25.497 - timing_s/step:157.587 - timing_per_token_ms/adv:0.001 - timing_per_token_ms/gen:0.076 - timing_per_token_ms/update_actor:0.084 - timing_per_token_ms/values:0.041 - timing_per_token_ms/update_critic:0.085 - timing_per_token_ms/ref:0.036 - perf/total_num_tokens:397089.000 - perf/time_per_step:157.587 - perf/throughput:314.976 (TaskRunner pid=10086) list(reward_extra_infos_dict.keys())=[] (TaskRunner pid=10086) test_gen_batch meta info: {'eos_token_id': 32021, 'pad_token_id': 32014, 'recompute_log_prob': False, 'do_sample': False, 'validate': True} (WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered (WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] Traceback (most recent call last): File "/data00/tiger/huggingface/verl/verl/verl/trainer/main_ppo.py", line 54, in main run_ppo(config) File "/data00/tiger/huggingface/verl/verl/verl/trainer/main_ppo.py", line 72, in run_ppo ray.get(runner.run.remote(config)) File "/root/miniconda3/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, **kwargs) File "/root/miniconda3/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) File "/root/miniconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 2667, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) File "/root/miniconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 864, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(RuntimeError): ray::TaskRunner.run() (pid=10086, ip=172.20.0.2, actor_id=11bc451866f5759f3a7f540[501](https://github.com/volcengine/verl/actions/runs/14220739954/job/39861249946#step:6:502)000000, repr=<main_ppo.TaskRunner object at 0x7fd00c61a110>) File "/data00/tiger/huggingface/verl/verl/verl/trainer/main_ppo.py", line 184, in run trainer.fit() File "/data00/tiger/huggingface/verl/verl/verl/trainer/ppo/ray_trainer.py", line 950, in fit val_metrics: dict = self._validate() File "/data00/tiger/huggingface/verl/verl/verl/trainer/ppo/ray_trainer.py", line 545, in _validate test_output_gen_batch_padded = self.actor_rollout_wg.generate_sequences(test_gen_batch_padded) File "/data00/tiger/huggingface/verl/verl/verl/single_controller/ray/base.py", line 42, in func output = ray.get(output) ray.exceptions.RayTaskError(RuntimeError): ray::WorkerDict.actor_rollout_generate_sequences() (pid=18980, ip=172.20.0.2, actor_id=4f21075809bd462a5907ebea01000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7fc62ae1ce20>) File "/root/miniconda3/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1708, in execute_model output: SamplerOutput = self.model.sample( File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 571, in sample next_tokens = self.sampler(logits, sampling_metadata) File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 231, in forward self._init_sampling_tensors(logits, sampling_metadata) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 195, in _init_sampling_tensors do_min_p) = SamplingTensors.from_sampling_metadata( File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/sampling_metadata.py", line 471, in from_sampling_metadata sampling_tensors = SamplingTensors.from_lists( File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/sampling_metadata.py", line 529, in from_lists temperatures_t = torch.tensor( RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. ```

Reverts volcengine#706 temporarily as it breaks CI https://github.com/volcengine/verl/actions/runs/14220739954/attempts/2 ``` (TaskRunner pid=10086) 'Initial validation metrics: {}' (TaskRunner pid=10086) step:0 (TaskRunner pid=10086) list(reward_extra_infos_dict.keys())=[] (TaskRunner pid=10086) test_gen_batch meta info: {'eos_token_id': 32021, 'pad_token_id': 32014, 'recompute_log_prob': False, 'do_sample': False, 'validate': True} (TaskRunner pid=10086) validation generation end (TaskRunner pid=10086) [prompt] You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer (TaskRunner pid=10086) ### Instruction: (TaskRunner pid=10086) Training Progress: 33%|███▎ | 1/3 [02:39<05:18, 159.11s/it] (WorkerDict pid=18977) /root/miniconda3/lib/python3.10/site-packages/torch/autograd/graph.py:768: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [repeated 7x across cluster] (WorkerDict pid=18977) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [repeated 7x across cluster] (TaskRunner pid=10086) Training Progress: 33%|███▎ | 1/3 [04:51<09:43, 291.93s/it] (WorkerDict pid=18980) [rank4]:[E402 16:49:38.988158820 ProcessGroupNCCL.cpp:1515] [PG 97 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered (WorkerDict pid=18980) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (WorkerDict pid=18980) For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (WorkerDict pid=18980) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (WorkerDict pid=18980) (WorkerDict pid=18980) Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): (WorkerDict pid=18980) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc6e4177f86 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so) (WorkerDict pid=18980) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc6e4126d10 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so) (WorkerDict pid=18980) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc6e4594f08 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10_cuda.so) (WorkerDict pid=18980) frame volcengine#3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fc6927d2a56 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame volcengine#4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fc6927d7c70 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame volcengine#5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fc6927de92a in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame volcengine#6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc6927e0d6c in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame volcengine#7: <unknown function> + 0xdbbf4 (0x7fc9fd477bf4 in /root/miniconda3/bin/../lib/libstdc++.so.6) (WorkerDict pid=18980) frame volcengine#8: <unknown function> + 0x94ac3 (0x7fc9ff2f0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=18980) frame volcengine#9: clone + 0x44 (0x7fc9ff381a04 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=18980) (WorkerDict pid=18980) [2025-04-02 16:49:38,666 E 18980 20767] logging.cc:97: Unhandled exception: N3c1016DistBackendErrorE. what(): [PG 97 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered (WorkerDict pid=18980) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (WorkerDict pid=18980) For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (WorkerDict pid=18980) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (WorkerDict pid=18980) (WorkerDict pid=18980) Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): (WorkerDict pid=18980) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc6e4177f86 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so) (WorkerDict pid=18980) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc6e4126d10 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so) (WorkerDict pid=18980) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc6e4594f08 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10_cuda.so) (WorkerDict pid=18980) frame volcengine#3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fc6927d2a56 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame volcengine#4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fc6927d7c70 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame volcengine#5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fc6927de92a in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame volcengine#6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc6927e0d6c in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame volcengine#7: <unknown function> + 0xdbbf4 (0x7fc9fd477bf4 in /root/miniconda3/bin/../lib/libstdc++.so.6) (WorkerDict pid=18980) frame volcengine#8: <unknown function> + 0x94ac3 (0x7fc9ff2f0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=18980) frame volcengine#9: clone + 0x44 (0x7fc9ff381a04 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=18980) (WorkerDict pid=18980) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first): (WorkerDict pid=18980) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc6e4177f86 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so) (WorkerDict pid=18980) frame #1: <unknown function> + 0xe1a5e4 (0x7fc6924625e4 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame #2: <unknown function> + 0xdbbf4 (0x7fc9fd477bf4 in /root/miniconda3/bin/../lib/libstdc++.so.6) (WorkerDict pid=18980) frame volcengine#3: <unknown function> + 0x94ac3 (0x7fc9ff2f0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=18980) frame volcengine#4: clone + 0x44 (0x7fc9ff381a04 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=18980) (WorkerDict pid=18980) [2025-04-02 16:49:38,675 E 18980 20767] logging.cc:104: Stack trace: (WorkerDict pid=18980) /root/miniconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xfe543a) [0x7fc9fe5a143a] ray::operator<<() (WorkerDict pid=18980) /root/miniconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xfe7b78) [0x7fc9fe5a3b78] ray::TerminateHandler() (WorkerDict pid=18980) /root/miniconda3/bin/../lib/libstdc++.so.6(+0xb135a) [0x7fc9fd44d35a] __cxxabiv1::__terminate() (WorkerDict pid=18980) /root/miniconda3/bin/../lib/libstdc++.so.6(+0xb13c5) [0x7fc9fd44d3c5] (WorkerDict pid=18980) /root/miniconda3/bin/../lib/libstdc++.so.6(+0xb134f) [0x7fc9fd44d34f] (WorkerDict pid=18980) /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so(+0xe1a695) [0x7fc692462695] c10d::ProcessGroupNCCL::ncclCommWatchdog() (WorkerDict pid=18980) /root/miniconda3/bin/../lib/libstdc++.so.6(+0xdbbf4) [0x7fc9fd477bf4] execute_native_thread_routine (WorkerDict pid=18980) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fc9ff2f0ac3] (WorkerDict pid=18980) /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44) [0x7fc9ff381a04] __clone (WorkerDict pid=18980) (WorkerDict pid=18980) *** SIGABRT received at time=1743612578 on cpu 118 *** (WorkerDict pid=18980) PC: @ 0x7fc9ff2f29fc (unknown) pthread_kill (WorkerDict pid=18980) @ 0x7fc9ff29e520 (unknown) (unknown) (WorkerDict pid=18980) [2025-04-02 16:49:38,675 E 18980 20767] logging.cc:361: *** SIGABRT received at time=1743612578 on cpu 118 *** (WorkerDict pid=18980) [2025-04-02 16:49:38,675 E 18980 20767] logging.cc:361: PC: @ 0x7fc9ff2f29fc (unknown) pthread_kill (WorkerDict pid=18980) [2025-04-02 16:49:38,675 E 18980 20767] logging.cc:361: @ 0x7fc9ff29e520 (unknown) (unknown) (WorkerDict pid=18980) Fatal Python error: Aborted (WorkerDict pid=18980) (WorkerDict pid=18980) (WorkerDict pid=18980) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, _brotli, zstandard.backend_c, uvloop.loop, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, markupsafe._speedups, PIL._imaging, msgspec._core, sentencepiece._sentencepiece, PIL._imagingft, regex._regex, multidict._multidict, yarl._helpers_c, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, pyarrow._json, zmq.backend.cython.context, zmq.backend.cython.message, zmq.backend.cython.socket, zmq.backend.cython._device, zmq.backend.cython._poll, zmq.backend.cython._proxy_steerable, zmq.backend.cython._version, zmq.backend.cython.error, zmq.backend.cython.utils (total: 96) Error executing job with overrides: ['algorithm.adv_estimator=gae', 'data.train_files=/github/home/data/gsm8k/train.parquet', 'data.val_files=/github/home/data/gsm8k/test.parquet', 'data.train_batch_size=1024', 'data.max_prompt_length=512', 'data.max_response_length=512', 'actor_rollout_ref.model.path=/github/home/models/deepseek-ai/deepseek-coder-1.3b-instruct', 'actor_rollout_ref.actor.optim.lr=2e-6', 'actor_rollout_ref.actor.ppo_mini_batch_size=256', 'actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4', 'actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=2', 'actor_rollout_ref.actor.megatron.virtual_pipeline_model_parallel_size=2', 'actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4', 'actor_rollout_ref.actor.use_kl_loss=False', 'actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8', 'actor_rollout_ref.rollout.tensor_model_parallel_size=2', 'actor_rollout_ref.rollout.name=vllm', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.5', 'actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16', 'actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=2', 'actor_rollout_ref.ref.megatron.virtual_pipeline_model_parallel_size=2', 'actor_rollout_ref.ref.megatron.tensor_model_parallel_size=2', 'critic.optim.lr=2e-5', 'critic.model.path=/github/home/models/deepseek-ai/deepseek-coder-1.3b-instruct', 'critic.model.enable_gradient_checkpointing=False', 'critic.ppo_micro_batch_size_per_gpu=4', 'critic.megatron.pipeline_model_parallel_size=2', 'critic.megatron.virtual_pipeline_model_parallel_size=2', 'critic.megatron.tensor_model_parallel_size=2', 'algorithm.use_kl_in_reward=True', 'algorithm.kl_penalty=kl', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.critic_warmup=0', 'trainer.logger=[console]', 'trainer.project_name=verl_megatron_gsm8k_examples', 'trainer.experiment_name=deepseek_llm_1b3_function_rm', 'trainer.n_gpus_per_node=8', 'trainer.nnodes=1', 'trainer.save_freq=-1', 'trainer.test_freq=1', 'trainer.total_epochs=15', 'trainer.total_training_steps=3'] (TaskRunner pid=10086) Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market? Let's think step by step and output the final answer after "####". (TaskRunner pid=10086) ### Response: (TaskRunner pid=10086) (TaskRunner pid=10086) [response] I'm sorry, but as an AI programming assistant, I'm specialized in answering questions related to computer science. I'm not equipped to provide answers to questions about economics or business calculations. I recommend using a calculator or a business-oriented tool for this type of question. (TaskRunner pid=10086) (TaskRunner pid=10086) [ground_truth] 18 (TaskRunner pid=10086) [score] 0.0 (TaskRunner pid=10086) step:1 - global_seqlen/min:[486](https://github.com/volcengine/verl/actions/runs/14220739954/job/39861249946#step:6:487)35.000 - global_seqlen/max:51694.000 - global_seqlen/minmax_diff:3059.000 - global_seqlen/balanced_min:49636.000 - global_seqlen/balanced_max:49637.000 - global_seqlen/mean:49636.125 - actor/reward_kl_penalty:0.000 - actor/reward_kl_penalty_coeff:0.001 - critic/vf_loss:0.015 - critic/vf_clipfrac:0.001 - critic/vpred_mean:0.007 - perf/mfu/critic:0.105 - actor/entropy_loss:0.550 - actor/pg_loss:-0.000 - actor/pg_clipfrac:0.018 - actor/ppo_kl:0.000 - actor/pg_clipfrac_lower:0.000 - perf/mfu/actor:0.106 - critic/score/mean:0.000 - critic/score/max:0.000 - critic/score/min:0.000 - critic/rewards/mean:0.000 - critic/rewards/max:0.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.000 - critic/advantages/max:4.994 - critic/advantages/min:-5.666 - critic/returns/mean:-0.000 - critic/returns/max:0.000 - critic/returns/min:-0.000 - critic/values/mean:-0.164 - critic/values/max:0.785 - critic/values/min:-1.000 - critic/vf_explained_var:-2803.085 - response_length/mean:239.112 - response_length/max:512.000 - response_length/min:11.000 - response_length/clip_ratio:0.029 - prompt_length/mean:148.670 - prompt_length/max:275.000 - prompt_length/min:106.000 - prompt_length/clip_ratio:0.000 - timing_s/gen:18.608 - timing_s/old_log_prob:15.249 - timing_s/ref:14.[488](https://github.com/volcengine/verl/actions/runs/14220739954/job/39861249946#step:6:489) - timing_s/values:16.315 - timing_s/adv:0.264 - timing_s/update_critic:33.651 - timing_s/update_actor:33.472 - timing_s/testing:25.497 - timing_s/step:157.587 - timing_per_token_ms/adv:0.001 - timing_per_token_ms/gen:0.076 - timing_per_token_ms/update_actor:0.084 - timing_per_token_ms/values:0.041 - timing_per_token_ms/update_critic:0.085 - timing_per_token_ms/ref:0.036 - perf/total_num_tokens:397089.000 - perf/time_per_step:157.587 - perf/throughput:314.976 (TaskRunner pid=10086) list(reward_extra_infos_dict.keys())=[] (TaskRunner pid=10086) test_gen_batch meta info: {'eos_token_id': 32021, 'pad_token_id': 32014, 'recompute_log_prob': False, 'do_sample': False, 'validate': True} (WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered (WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] Traceback (most recent call last): File "/data00/tiger/huggingface/verl/verl/verl/trainer/main_ppo.py", line 54, in main run_ppo(config) File "/data00/tiger/huggingface/verl/verl/verl/trainer/main_ppo.py", line 72, in run_ppo ray.get(runner.run.remote(config)) File "/root/miniconda3/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, **kwargs) File "/root/miniconda3/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) File "/root/miniconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 2667, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) File "/root/miniconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 864, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(RuntimeError): ray::TaskRunner.run() (pid=10086, ip=172.20.0.2, actor_id=11bc451866f5759f3a7f540[501](https://github.com/volcengine/verl/actions/runs/14220739954/job/39861249946#step:6:502)000000, repr=<main_ppo.TaskRunner object at 0x7fd00c61a110>) File "/data00/tiger/huggingface/verl/verl/verl/trainer/main_ppo.py", line 184, in run trainer.fit() File "/data00/tiger/huggingface/verl/verl/verl/trainer/ppo/ray_trainer.py", line 950, in fit val_metrics: dict = self._validate() File "/data00/tiger/huggingface/verl/verl/verl/trainer/ppo/ray_trainer.py", line 545, in _validate test_output_gen_batch_padded = self.actor_rollout_wg.generate_sequences(test_gen_batch_padded) File "/data00/tiger/huggingface/verl/verl/verl/single_controller/ray/base.py", line 42, in func output = ray.get(output) ray.exceptions.RayTaskError(RuntimeError): ray::WorkerDict.actor_rollout_generate_sequences() (pid=18980, ip=172.20.0.2, actor_id=4f21075809bd462a5907ebea01000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7fc62ae1ce20>) File "/root/miniconda3/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1708, in execute_model output: SamplerOutput = self.model.sample( File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 571, in sample next_tokens = self.sampler(logits, sampling_metadata) File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 231, in forward self._init_sampling_tensors(logits, sampling_metadata) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 195, in _init_sampling_tensors do_min_p) = SamplingTensors.from_sampling_metadata( File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/sampling_metadata.py", line 471, in from_sampling_metadata sampling_tensors = SamplingTensors.from_lists( File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/sampling_metadata.py", line 529, in from_lists temperatures_t = torch.tensor( RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. ```

* Improve installation instructions; clean up .egg_info Signed-off-by: SumanthRH <[email protected]> * updates Signed-off-by: SumanthRH <[email protected]> * x Signed-off-by: SumanthRH <[email protected]> * x Signed-off-by: SumanthRH <[email protected]> * x Signed-off-by: SumanthRH <[email protected]> * add python version Signed-off-by: SumanthRH <[email protected]> --------- Signed-off-by: SumanthRH <[email protected]>

Reverts volcengine#706 temporarily as it breaks CI https://github.com/volcengine/verl/actions/runs/14220739954/attempts/2 ``` (TaskRunner pid=10086) 'Initial validation metrics: {}' (TaskRunner pid=10086) step:0 (TaskRunner pid=10086) list(reward_extra_infos_dict.keys())=[] (TaskRunner pid=10086) test_gen_batch meta info: {'eos_token_id': 32021, 'pad_token_id': 32014, 'recompute_log_prob': False, 'do_sample': False, 'validate': True} (TaskRunner pid=10086) validation generation end (TaskRunner pid=10086) [prompt] You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer (TaskRunner pid=10086) ### Instruction: (TaskRunner pid=10086) Training Progress: 33%|███▎ | 1/3 [02:39<05:18, 159.11s/it] (WorkerDict pid=18977) /root/miniconda3/lib/python3.10/site-packages/torch/autograd/graph.py:768: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [repeated 7x across cluster] (WorkerDict pid=18977) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [repeated 7x across cluster] (TaskRunner pid=10086) Training Progress: 33%|███▎ | 1/3 [04:51<09:43, 291.93s/it] (WorkerDict pid=18980) [rank4]:[E402 16:49:38.988158820 ProcessGroupNCCL.cpp:1515] [PG 97 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered (WorkerDict pid=18980) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (WorkerDict pid=18980) For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (WorkerDict pid=18980) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (WorkerDict pid=18980) (WorkerDict pid=18980) Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): (WorkerDict pid=18980) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc6e4177f86 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so) (WorkerDict pid=18980) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc6e4126d10 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so) (WorkerDict pid=18980) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc6e4594f08 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10_cuda.so) (WorkerDict pid=18980) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fc6927d2a56 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fc6927d7c70 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fc6927de92a in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc6927e0d6c in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame volcengine#7: <unknown function> + 0xdbbf4 (0x7fc9fd477bf4 in /root/miniconda3/bin/../lib/libstdc++.so.6) (WorkerDict pid=18980) frame volcengine#8: <unknown function> + 0x94ac3 (0x7fc9ff2f0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=18980) frame volcengine#9: clone + 0x44 (0x7fc9ff381a04 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=18980) (WorkerDict pid=18980) [2025-04-02 16:49:38,666 E 18980 20767] logging.cc:97: Unhandled exception: N3c1016DistBackendErrorE. what(): [PG 97 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered (WorkerDict pid=18980) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (WorkerDict pid=18980) For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (WorkerDict pid=18980) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (WorkerDict pid=18980) (WorkerDict pid=18980) Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): (WorkerDict pid=18980) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc6e4177f86 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so) (WorkerDict pid=18980) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc6e4126d10 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so) (WorkerDict pid=18980) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc6e4594f08 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10_cuda.so) (WorkerDict pid=18980) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fc6927d2a56 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fc6927d7c70 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fc6927de92a in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc6927e0d6c in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame volcengine#7: <unknown function> + 0xdbbf4 (0x7fc9fd477bf4 in /root/miniconda3/bin/../lib/libstdc++.so.6) (WorkerDict pid=18980) frame volcengine#8: <unknown function> + 0x94ac3 (0x7fc9ff2f0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=18980) frame volcengine#9: clone + 0x44 (0x7fc9ff381a04 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=18980) (WorkerDict pid=18980) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first): (WorkerDict pid=18980) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc6e4177f86 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so) (WorkerDict pid=18980) frame #1: <unknown function> + 0xe1a5e4 (0x7fc6924625e4 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame #2: <unknown function> + 0xdbbf4 (0x7fc9fd477bf4 in /root/miniconda3/bin/../lib/libstdc++.so.6) (WorkerDict pid=18980) frame #3: <unknown function> + 0x94ac3 (0x7fc9ff2f0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=18980) frame #4: clone + 0x44 (0x7fc9ff381a04 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=18980) (WorkerDict pid=18980) [2025-04-02 16:49:38,675 E 18980 20767] logging.cc:104: Stack trace: (WorkerDict pid=18980) /root/miniconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xfe543a) [0x7fc9fe5a143a] ray::operator<<() (WorkerDict pid=18980) /root/miniconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xfe7b78) [0x7fc9fe5a3b78] ray::TerminateHandler() (WorkerDict pid=18980) /root/miniconda3/bin/../lib/libstdc++.so.6(+0xb135a) [0x7fc9fd44d35a] __cxxabiv1::__terminate() (WorkerDict pid=18980) /root/miniconda3/bin/../lib/libstdc++.so.6(+0xb13c5) [0x7fc9fd44d3c5] (WorkerDict pid=18980) /root/miniconda3/bin/../lib/libstdc++.so.6(+0xb134f) [0x7fc9fd44d34f] (WorkerDict pid=18980) /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so(+0xe1a695) [0x7fc692462695] c10d::ProcessGroupNCCL::ncclCommWatchdog() (WorkerDict pid=18980) /root/miniconda3/bin/../lib/libstdc++.so.6(+0xdbbf4) [0x7fc9fd477bf4] execute_native_thread_routine (WorkerDict pid=18980) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fc9ff2f0ac3] (WorkerDict pid=18980) /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44) [0x7fc9ff381a04] __clone (WorkerDict pid=18980) (WorkerDict pid=18980) *** SIGABRT received at time=1743612578 on cpu 118 *** (WorkerDict pid=18980) PC: @ 0x7fc9ff2f29fc (unknown) pthread_kill (WorkerDict pid=18980) @ 0x7fc9ff29e520 (unknown) (unknown) (WorkerDict pid=18980) [2025-04-02 16:49:38,675 E 18980 20767] logging.cc:361: *** SIGABRT received at time=1743612578 on cpu 118 *** (WorkerDict pid=18980) [2025-04-02 16:49:38,675 E 18980 20767] logging.cc:361: PC: @ 0x7fc9ff2f29fc (unknown) pthread_kill (WorkerDict pid=18980) [2025-04-02 16:49:38,675 E 18980 20767] logging.cc:361: @ 0x7fc9ff29e520 (unknown) (unknown) (WorkerDict pid=18980) Fatal Python error: Aborted (WorkerDict pid=18980) (WorkerDict pid=18980) (WorkerDict pid=18980) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, _brotli, zstandard.backend_c, uvloop.loop, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, markupsafe._speedups, PIL._imaging, msgspec._core, sentencepiece._sentencepiece, PIL._imagingft, regex._regex, multidict._multidict, yarl._helpers_c, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, pyarrow._json, zmq.backend.cython.context, zmq.backend.cython.message, zmq.backend.cython.socket, zmq.backend.cython._device, zmq.backend.cython._poll, zmq.backend.cython._proxy_steerable, zmq.backend.cython._version, zmq.backend.cython.error, zmq.backend.cython.utils (total: 96) Error executing job with overrides: ['algorithm.adv_estimator=gae', 'data.train_files=/github/home/data/gsm8k/train.parquet', 'data.val_files=/github/home/data/gsm8k/test.parquet', 'data.train_batch_size=1024', 'data.max_prompt_length=512', 'data.max_response_length=512', 'actor_rollout_ref.model.path=/github/home/models/deepseek-ai/deepseek-coder-1.3b-instruct', 'actor_rollout_ref.actor.optim.lr=2e-6', 'actor_rollout_ref.actor.ppo_mini_batch_size=256', 'actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4', 'actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=2', 'actor_rollout_ref.actor.megatron.virtual_pipeline_model_parallel_size=2', 'actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4', 'actor_rollout_ref.actor.use_kl_loss=False', 'actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8', 'actor_rollout_ref.rollout.tensor_model_parallel_size=2', 'actor_rollout_ref.rollout.name=vllm', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.5', 'actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16', 'actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=2', 'actor_rollout_ref.ref.megatron.virtual_pipeline_model_parallel_size=2', 'actor_rollout_ref.ref.megatron.tensor_model_parallel_size=2', 'critic.optim.lr=2e-5', 'critic.model.path=/github/home/models/deepseek-ai/deepseek-coder-1.3b-instruct', 'critic.model.enable_gradient_checkpointing=False', 'critic.ppo_micro_batch_size_per_gpu=4', 'critic.megatron.pipeline_model_parallel_size=2', 'critic.megatron.virtual_pipeline_model_parallel_size=2', 'critic.megatron.tensor_model_parallel_size=2', 'algorithm.use_kl_in_reward=True', 'algorithm.kl_penalty=kl', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.critic_warmup=0', 'trainer.logger=[console]', 'trainer.project_name=verl_megatron_gsm8k_examples', 'trainer.experiment_name=deepseek_llm_1b3_function_rm', 'trainer.n_gpus_per_node=8', 'trainer.nnodes=1', 'trainer.save_freq=-1', 'trainer.test_freq=1', 'trainer.total_epochs=15', 'trainer.total_training_steps=3'] (TaskRunner pid=10086) Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market? Let's think step by step and output the final answer after "####". (TaskRunner pid=10086) ### Response: (TaskRunner pid=10086) (TaskRunner pid=10086) [response] I'm sorry, but as an AI programming assistant, I'm specialized in answering questions related to computer science. I'm not equipped to provide answers to questions about economics or business calculations. I recommend using a calculator or a business-oriented tool for this type of question. (TaskRunner pid=10086) (TaskRunner pid=10086) [ground_truth] 18 (TaskRunner pid=10086) [score] 0.0 (TaskRunner pid=10086) step:1 - global_seqlen/min:[486](https://github.com/volcengine/verl/actions/runs/14220739954/job/39861249946#step:6:487)35.000 - global_seqlen/max:51694.000 - global_seqlen/minmax_diff:3059.000 - global_seqlen/balanced_min:49636.000 - global_seqlen/balanced_max:49637.000 - global_seqlen/mean:49636.125 - actor/reward_kl_penalty:0.000 - actor/reward_kl_penalty_coeff:0.001 - critic/vf_loss:0.015 - critic/vf_clipfrac:0.001 - critic/vpred_mean:0.007 - perf/mfu/critic:0.105 - actor/entropy_loss:0.550 - actor/pg_loss:-0.000 - actor/pg_clipfrac:0.018 - actor/ppo_kl:0.000 - actor/pg_clipfrac_lower:0.000 - perf/mfu/actor:0.106 - critic/score/mean:0.000 - critic/score/max:0.000 - critic/score/min:0.000 - critic/rewards/mean:0.000 - critic/rewards/max:0.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.000 - critic/advantages/max:4.994 - critic/advantages/min:-5.666 - critic/returns/mean:-0.000 - critic/returns/max:0.000 - critic/returns/min:-0.000 - critic/values/mean:-0.164 - critic/values/max:0.785 - critic/values/min:-1.000 - critic/vf_explained_var:-2803.085 - response_length/mean:239.112 - response_length/max:512.000 - response_length/min:11.000 - response_length/clip_ratio:0.029 - prompt_length/mean:148.670 - prompt_length/max:275.000 - prompt_length/min:106.000 - prompt_length/clip_ratio:0.000 - timing_s/gen:18.608 - timing_s/old_log_prob:15.249 - timing_s/ref:14.[488](https://github.com/volcengine/verl/actions/runs/14220739954/job/39861249946#step:6:489) - timing_s/values:16.315 - timing_s/adv:0.264 - timing_s/update_critic:33.651 - timing_s/update_actor:33.472 - timing_s/testing:25.497 - timing_s/step:157.587 - timing_per_token_ms/adv:0.001 - timing_per_token_ms/gen:0.076 - timing_per_token_ms/update_actor:0.084 - timing_per_token_ms/values:0.041 - timing_per_token_ms/update_critic:0.085 - timing_per_token_ms/ref:0.036 - perf/total_num_tokens:397089.000 - perf/time_per_step:157.587 - perf/throughput:314.976 (TaskRunner pid=10086) list(reward_extra_infos_dict.keys())=[] (TaskRunner pid=10086) test_gen_batch meta info: {'eos_token_id': 32021, 'pad_token_id': 32014, 'recompute_log_prob': False, 'do_sample': False, 'validate': True} (WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered (WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] Traceback (most recent call last): File "/data00/tiger/huggingface/verl/verl/verl/trainer/main_ppo.py", line 54, in main run_ppo(config) File "/data00/tiger/huggingface/verl/verl/verl/trainer/main_ppo.py", line 72, in run_ppo ray.get(runner.run.remote(config)) File "/root/miniconda3/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, **kwargs) File "/root/miniconda3/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) File "/root/miniconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 2667, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) File "/root/miniconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 864, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(RuntimeError): ray::TaskRunner.run() (pid=10086, ip=172.20.0.2, actor_id=11bc451866f5759f3a7f540[501](https://github.com/volcengine/verl/actions/runs/14220739954/job/39861249946#step:6:502)000000, repr=<main_ppo.TaskRunner object at 0x7fd00c61a110>) File "/data00/tiger/huggingface/verl/verl/verl/trainer/main_ppo.py", line 184, in run trainer.fit() File "/data00/tiger/huggingface/verl/verl/verl/trainer/ppo/ray_trainer.py", line 950, in fit val_metrics: dict = self._validate() File "/data00/tiger/huggingface/verl/verl/verl/trainer/ppo/ray_trainer.py", line 545, in _validate test_output_gen_batch_padded = self.actor_rollout_wg.generate_sequences(test_gen_batch_padded) File "/data00/tiger/huggingface/verl/verl/verl/single_controller/ray/base.py", line 42, in func output = ray.get(output) ray.exceptions.RayTaskError(RuntimeError): ray::WorkerDict.actor_rollout_generate_sequences() (pid=18980, ip=172.20.0.2, actor_id=4f21075809bd462a5907ebea01000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7fc62ae1ce20>) File "/root/miniconda3/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1708, in execute_model output: SamplerOutput = self.model.sample( File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 571, in sample next_tokens = self.sampler(logits, sampling_metadata) File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 231, in forward self._init_sampling_tensors(logits, sampling_metadata) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 195, in _init_sampling_tensors do_min_p) = SamplingTensors.from_sampling_metadata( File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/sampling_metadata.py", line 471, in from_sampling_metadata sampling_tensors = SamplingTensors.from_lists( File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/sampling_metadata.py", line 529, in from_lists temperatures_t = torch.tensor( RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. ```

[trainer, fsdp, vllm, recipe] feat: one step off async training recipe

上传单机双机脚本

PeterSH6 added 4 commits October 31, 2024 23:18

[misc] fix: fix pypi package missing of megatron model and fix torchr…

f5a0b07

…un issue

lint

448ebad

fix sft script

7395116

update version

ab93baa

PeterSH6 requested a review from vermouth1992 November 1, 2024 05:01

vermouth1992 approved these changes Nov 1, 2024

View reviewed changes

vermouth1992 merged commit 53bb5d2 into main Nov 1, 2024

vermouth1992 deleted the gm/release2 branch November 1, 2024 05:06

shenofusc mentioned this pull request Mar 31, 2025

Client socket time out when run verl.trainer.main_ppo with ray in Ascend NPU 910B #849

Open

yyu6969 pushed a commit to yyu6969/verl that referenced this pull request Apr 19, 2025

volcengine#3

6fb6770

yuchenwang3 pushed a commit to yuchenwang3/verl that referenced this pull request Apr 25, 2025

[misc] fix: resolve pypi missing directory (volcengine#3)

f453d9d

* [misc] fix: fix pypi package missing of megatron model and fix torchrun issue * lint * fix sft script * update version

hgl71964 mentioned this pull request May 6, 2025

NCCL timeout Error using sglang rollout #1420

Closed

yuleiqin mentioned this pull request May 10, 2025

TimeOutError when ray::WorkerDict.critic_init_model #1309

Open

yuleiqin mentioned this pull request May 20, 2025

内存在训练过程中利用率会越来越高最后，oom #630

Open

LIYHUI mentioned this pull request Jul 10, 2025

[Bug] dist_checkpointing stuck on communication with MoE models in distributed environment #2238

Open

eric-haibin-lin pushed a commit that referenced this pull request Jul 17, 2025

Merge pull request #3 from imh966/recipe/async_training_rollout_nodes

8e5b714

[trainer, fsdp, vllm, recipe] feat: one step off async training recipe

pillumina pushed a commit to pillumina/verl that referenced this pull request Jul 24, 2025

Merge pull request volcengine#3 from wlf-darkmatter/main

041ae97

上传单机双机脚本

SimonHeye mentioned this pull request Sep 23, 2025

RuntimeError: CUDA error: misaligned address #3579

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[misc] fix: resolve pypi missing directory #3

[misc] fix: resolve pypi missing directory #3

Uh oh!

PeterSH6 commented Nov 1, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[misc] fix: resolve pypi missing directory #3

[misc] fix: resolve pypi missing directory #3

Uh oh!

Conversation

PeterSH6 commented Nov 1, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants