Skip to content

Cannot run geo3k multiturn example #3647

@huaiyizhao

Description

@huaiyizhao

System Info

I use the official image app-verl0.6-transformers4.56.1-sglang0.5.2-mcore0.13.0-te2.2


----------Python Info----------
Version : 3.12.3
Compiler : GCC 13.3.0
Build : ('main', 'Feb 4 2025 14:48:35')
Arch : ('64bit', 'ELF')
------------Pip Info-----------
Version : 25.2
Directory : /usr/local/lib/python3.12/dist-packages/pip
vllm : not found.
sglang : 0.5.2
ray : 2.49.2
torch : 2.8.0
----------verl Info-----------
Version : 0.5.0.dev
Directory : /app/verl/verl
Commit Hash : 362ebfbcaf6d37c50003fef60f2176f9f76aaeb2
----------Platform Info----------
Platform : Linux-5.4.241-1-tlinux4-0017.7-x86_64-with-glibc2.39
system : Linux
node : TENCENT64.site
release : 5.4.241-1-tlinux4-0017.7
version : #1 SMP Thu Jan 18 11:33:00 CST 2024
----------Environment----------
CUDA Runtime : 12.8
CUDA Compiler : Cuda compilation tools, release 12.8, V12.8.93
----------System Info----------
CPU Memory : 2265.25 GB
GPU Count : 8
GPU 1 Type : NVIDIA H20
GPU 1 Memory : 95.58 GB
GPU 2 Type : NVIDIA H20
GPU 2 Memory : 95.58 GB
GPU 3 Type : NVIDIA H20
GPU 3 Memory : 95.58 GB
GPU 4 Type : NVIDIA H20
GPU 4 Memory : 95.58 GB
GPU 5 Type : NVIDIA H20
GPU 5 Memory : 95.58 GB
GPU 6 Type : NVIDIA H20
GPU 6 Memory : 95.58 GB
GPU 7 Type : NVIDIA H20
GPU 7 Memory : 95.58 GB
GPU 8 Type : NVIDIA H20
GPU 8 Memory : 95.58 GB


The multiturn example bash examples/sglang_multiturn/geo3k/run_qwen2.5-3b_geo3k_multiturn.sh encounters the following error.

ray.exceptions.RayTaskError(ValueError): ray::WorkerDict.actor_rollout_compute_log_prob() (pid=340926, ip=29.177.195.134, actor_id=c4162d864d53bb90020f271101000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7ef66b0bc140>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/verl/verl/single_controller/ray/base.py", line 700, in func
return getattr(self.worker_dict[key], name)(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/verl/verl/single_controller/base/decorator.py", line 433, in inner
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/app/verl/verl/utils/profiler/profile.py", line 256, in wrapper
return func(self_instance, *args, **kwargs_inner)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/verl/verl/workers/fsdp_workers.py", line 958, in compute_log_prob
output, entropys = self.actor.compute_log_prob(data=data, calculate_entropy=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/verl/verl/utils/profiler/performance.py", line 105, in f
return self.log(decorated_function, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/verl/verl/utils/profiler/performance.py", line 118, in log
output = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/app/verl/verl/workers/actor/dp_actor.py", line 339, in compute_log_prob
entropy, log_probs = self._forward_micro_batch(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/verl/verl/workers/actor/dp_actor.py", line 170, in _forward_micro_batch
output = self.actor_module(
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 854, in forward
output = self._fsdp_wrapped_module(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/verl/verl/models/transformers/qwen2_vl.py", line 474, in forward_with_normal_backend
outputs = qwen2_vl_forward(self, input_ids, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/verl/verl/models/transformers/qwen2_vl.py", line 447, in qwen2_vl_forward
position_ids=process_position_ids(position_ids),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/verl/verl/models/transformers/qwen2_vl.py", line 397, in process_position_ids
raise ValueError("position_ids should be a 3D tensor of shape (4, batch_size, seq_length).")
ValueError: position_ids should be a 3D tensor of shape (4, batch_size, seq_length).

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. pull the officail docker image
  2. run into container
  3. pull verl (commit 362ebfbcaf6d37c50003fef60f2176f9f76aaeb2)
  4. pip install .
  5. python examples/data_preprocess/geo3k_multiturn_w_tool.py
  6. bash examples/sglang_multiturn/geo3k/run_qwen2.5-3b_geo3k_multiturn.sh

Expected behavior

Run correctly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions