-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
System Info
I use the official image app-verl0.6-transformers4.56.1-sglang0.5.2-mcore0.13.0-te2.2
----------Python Info----------
Version : 3.12.3
Compiler : GCC 13.3.0
Build : ('main', 'Feb 4 2025 14:48:35')
Arch : ('64bit', 'ELF')
------------Pip Info-----------
Version : 25.2
Directory : /usr/local/lib/python3.12/dist-packages/pip
vllm : not found.
sglang : 0.5.2
ray : 2.49.2
torch : 2.8.0
----------verl Info-----------
Version : 0.5.0.dev
Directory : /app/verl/verl
Commit Hash : 362ebfbcaf6d37c50003fef60f2176f9f76aaeb2
----------Platform Info----------
Platform : Linux-5.4.241-1-tlinux4-0017.7-x86_64-with-glibc2.39
system : Linux
node : TENCENT64.site
release : 5.4.241-1-tlinux4-0017.7
version : #1 SMP Thu Jan 18 11:33:00 CST 2024
----------Environment----------
CUDA Runtime : 12.8
CUDA Compiler : Cuda compilation tools, release 12.8, V12.8.93
----------System Info----------
CPU Memory : 2265.25 GB
GPU Count : 8
GPU 1 Type : NVIDIA H20
GPU 1 Memory : 95.58 GB
GPU 2 Type : NVIDIA H20
GPU 2 Memory : 95.58 GB
GPU 3 Type : NVIDIA H20
GPU 3 Memory : 95.58 GB
GPU 4 Type : NVIDIA H20
GPU 4 Memory : 95.58 GB
GPU 5 Type : NVIDIA H20
GPU 5 Memory : 95.58 GB
GPU 6 Type : NVIDIA H20
GPU 6 Memory : 95.58 GB
GPU 7 Type : NVIDIA H20
GPU 7 Memory : 95.58 GB
GPU 8 Type : NVIDIA H20
GPU 8 Memory : 95.58 GB
The multiturn example bash examples/sglang_multiturn/geo3k/run_qwen2.5-3b_geo3k_multiturn.sh
encounters the following error.
ray.exceptions.RayTaskError(ValueError): ray::WorkerDict.actor_rollout_compute_log_prob() (pid=340926, ip=29.177.195.134, actor_id=c4162d864d53bb90020f271101000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7ef66b0bc140>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/verl/verl/single_controller/ray/base.py", line 700, in func
return getattr(self.worker_dict[key], name)(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/verl/verl/single_controller/base/decorator.py", line 433, in inner
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/app/verl/verl/utils/profiler/profile.py", line 256, in wrapper
return func(self_instance, *args, **kwargs_inner)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/verl/verl/workers/fsdp_workers.py", line 958, in compute_log_prob
output, entropys = self.actor.compute_log_prob(data=data, calculate_entropy=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/verl/verl/utils/profiler/performance.py", line 105, in f
return self.log(decorated_function, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/verl/verl/utils/profiler/performance.py", line 118, in log
output = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/app/verl/verl/workers/actor/dp_actor.py", line 339, in compute_log_prob
entropy, log_probs = self._forward_micro_batch(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/verl/verl/workers/actor/dp_actor.py", line 170, in _forward_micro_batch
output = self.actor_module(
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 854, in forward
output = self._fsdp_wrapped_module(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/verl/verl/models/transformers/qwen2_vl.py", line 474, in forward_with_normal_backend
outputs = qwen2_vl_forward(self, input_ids, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/verl/verl/models/transformers/qwen2_vl.py", line 447, in qwen2_vl_forward
position_ids=process_position_ids(position_ids),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/verl/verl/models/transformers/qwen2_vl.py", line 397, in process_position_ids
raise ValueError("position_ids should be a 3D tensor of shape (4, batch_size, seq_length).")
ValueError: position_ids should be a 3D tensor of shape (4, batch_size, seq_length).
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
- pull the officail docker image
- run into container
- pull verl (commit 362ebfbcaf6d37c50003fef60f2176f9f76aaeb2)
- pip install .
- python examples/data_preprocess/geo3k_multiturn_w_tool.py
- bash examples/sglang_multiturn/geo3k/run_qwen2.5-3b_geo3k_multiturn.sh
Expected behavior
Run correctly.