-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
System Info
----------Python Info----------
Version : 3.10.12
Compiler : GCC 11.4.0
Build : ('main', 'Nov 20 2023 15:14:05')
Arch : ('64bit', 'ELF')
------------Pip Info-----------
No corresponding pip install for current python.
vllm : 0.10.0
sglang : 0.4.10.post2
ray : 2.49.1
torch : 2.7.1
----------verl Info-----------
Version : 0.5.0.dev
Directory : /mnt/home/t-miazhang/gongrui/ms-deepresearch/verl/verl
Commit Hash : 2d6c6db
----------Platform Info----------
Platform : Linux-6.5.13-65-650-4141-22041-coreweave-amd64-85c45edc-x86_64-with-glibc2.35
system : Linux
node : h100-226-147
release : 6.5.13-65-650-4141-22041-coreweave-amd64-85c45edc
version : #1 SMP PREEMPT_DYNAMIC Mon Oct 14 20:37:13 UTC 2024
----------Environment----------
CUDA Runtime : 12.6
CUDA Compiler : Cuda compilation tools, release 12.4, V12.4.131
----------System Info----------
CPU Memory : 2014.42 GB
GPU Count : 8
GPU 1 Type : NVIDIA H100 80GB HBM3
GPU 1 Memory : 79.65 GB
GPU 2 Type : NVIDIA H100 80GB HBM3
GPU 2 Memory : 79.65 GB
GPU 3 Type : NVIDIA H100 80GB HBM3
GPU 3 Memory : 79.65 GB
GPU 4 Type : NVIDIA H100 80GB HBM3
GPU 4 Memory : 79.65 GB
GPU 5 Type : NVIDIA H100 80GB HBM3
GPU 5 Memory : 79.65 GB
GPU 6 Type : NVIDIA H100 80GB HBM3
GPU 6 Memory : 79.65 GB
GPU 7 Type : NVIDIA H100 80GB HBM3
GPU 7 Memory : 79.65 GB
GPU 8 Type : NVIDIA H100 80GB HBM3
GPU 8 Memory : 79.65 GB
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
When using sglang for async mode rollout, I encounter this error:
(WorkerDict pid=1011655, ip=10.3.176.18) File "/mnt/home/t-miazhang/gongrui/ms-deepresearch/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context [repeated 27x
across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18) return func(*args, **kwargs) [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18) File "/mnt/home/t-miazhang/gongrui/ms-deepresearch/.venv/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 778, in event_loop_overlap [repea
ted 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18) self.process_input_requests(recv_reqs) [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18) File "/mnt/home/t-miazhang/gongrui/ms-deepresearch/.venv/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1050, in process_input_requests [
repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18) output = self._request_dispatcher(recv_req) [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18) File "/mnt/home/t-miazhang/gongrui/ms-deepresearch/.venv/lib/python3.10/site-packages/sglang/utils.py", line 479, in __call__ [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18) return fn(obj) [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18) File "/mnt/home/t-miazhang/gongrui/ms-deepresearch/.venv/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 874, in update_weights_f
rom_tensor [repeated 108x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18) success, message = self.tp_worker.update_weights_from_tensor(recv_req) [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18) success, message = self.worker.update_weights_from_tensor(recv_req) [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18) success, message = self.model_runner.update_weights_from_tensor( [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18) named_tensors = [ [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18) File "/mnt/home/t-miazhang/gongrui/ms-deepresearch/.venv/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 875, in <listcomp> [repe
ated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18) (name, _unwrap_tensor(tensor, tp_rank=self.tp_rank)) [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18) File "/mnt/home/t-miazhang/gongrui/ms-deepresearch/.venv/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 1774, in _unwrap_tensor
[repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18) tensor = tensor.get(tp_rank) [repeated 27x across cluster] (WorkerDict pid=1011655, ip=10.3.176.18) File "/mnt/home/t-miazhang/gongrui/ms-deepresearch/.venv/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 1786, in get [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18) return MultiprocessingSerializer.deserialize(self.values[rank]) [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18) File "/mnt/home/t-miazhang/gongrui/ms-deepresearch/.venv/lib/python3.10/site-packages/sglang/srt/utils.py", line 1869, in deserialize [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18) return ForkingPickler.loads(data) [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18) File "/mnt/home/t-miazhang/gongrui/ms-deepresearch/.venv/lib/python3.10/site-packages/sglang/srt/patch_torch.py", line 51, in _rebuild_cuda_tensor_modified [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18) return reductions._rebuild_cuda_tensor_original(*args) [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18) File "/mnt/home/t-miazhang/gongrui/ms-deepresearch/.venv/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 181, in rebuild_cuda_tensor [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18) storage = storage_cls._new_shared_cuda( [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18) File "/mnt/home/t-miazhang/gongrui/ms-deepresearch/.venv/lib/python3.10/site-packages/torch/storage.py", line 1452, in _new_shared_cuda [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18) return torch.UntypedStorage._new_shared_cuda(*args, **kwargs) [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18) RuntimeError: pidfd_getfd: Operation not permitted [repeated 27x across cluster]
However, the error disappeared when switching to vllm.
#2846 solves this by using fsdp2, but I don't know how to solve this when using megatron
Expected behavior
sglang and megatron should work together well