Skip to content

RuntimeError: pidfd_getfd: Operation not permitted #3377

@TideDra

Description

@TideDra

System Info

----------Python Info----------
Version : 3.10.12
Compiler : GCC 11.4.0
Build : ('main', 'Nov 20 2023 15:14:05')
Arch : ('64bit', 'ELF')
------------Pip Info-----------
No corresponding pip install for current python.
vllm : 0.10.0
sglang : 0.4.10.post2
ray : 2.49.1
torch : 2.7.1
----------verl Info-----------
Version : 0.5.0.dev
Directory : /mnt/home/t-miazhang/gongrui/ms-deepresearch/verl/verl
Commit Hash : 2d6c6db
----------Platform Info----------
Platform : Linux-6.5.13-65-650-4141-22041-coreweave-amd64-85c45edc-x86_64-with-glibc2.35
system : Linux
node : h100-226-147
release : 6.5.13-65-650-4141-22041-coreweave-amd64-85c45edc
version : #1 SMP PREEMPT_DYNAMIC Mon Oct 14 20:37:13 UTC 2024
----------Environment----------
CUDA Runtime : 12.6
CUDA Compiler : Cuda compilation tools, release 12.4, V12.4.131
----------System Info----------
CPU Memory : 2014.42 GB
GPU Count : 8
GPU 1 Type : NVIDIA H100 80GB HBM3
GPU 1 Memory : 79.65 GB
GPU 2 Type : NVIDIA H100 80GB HBM3
GPU 2 Memory : 79.65 GB
GPU 3 Type : NVIDIA H100 80GB HBM3
GPU 3 Memory : 79.65 GB
GPU 4 Type : NVIDIA H100 80GB HBM3
GPU 4 Memory : 79.65 GB
GPU 5 Type : NVIDIA H100 80GB HBM3
GPU 5 Memory : 79.65 GB
GPU 6 Type : NVIDIA H100 80GB HBM3
GPU 6 Memory : 79.65 GB
GPU 7 Type : NVIDIA H100 80GB HBM3
GPU 7 Memory : 79.65 GB
GPU 8 Type : NVIDIA H100 80GB HBM3
GPU 8 Memory : 79.65 GB

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

When using sglang for async mode rollout, I encounter this error:

(WorkerDict pid=1011655, ip=10.3.176.18)   File "/mnt/home/t-miazhang/gongrui/ms-deepresearch/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context [repeated 27x 
across cluster]                                                                                                                                                                                           
(WorkerDict pid=1011655, ip=10.3.176.18)     return func(*args, **kwargs) [repeated 27x across cluster]                                                                                                   
(WorkerDict pid=1011655, ip=10.3.176.18)   File "/mnt/home/t-miazhang/gongrui/ms-deepresearch/.venv/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 778, in event_loop_overlap [repea
ted 27x across cluster]                                                                                                                                                                                   
(WorkerDict pid=1011655, ip=10.3.176.18)     self.process_input_requests(recv_reqs) [repeated 27x across cluster]                                                                                         
(WorkerDict pid=1011655, ip=10.3.176.18)   File "/mnt/home/t-miazhang/gongrui/ms-deepresearch/.venv/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1050, in process_input_requests [
repeated 27x across cluster]                                                                                                                                                                              
(WorkerDict pid=1011655, ip=10.3.176.18)     output = self._request_dispatcher(recv_req) [repeated 27x across cluster]                                                                                    
(WorkerDict pid=1011655, ip=10.3.176.18)   File "/mnt/home/t-miazhang/gongrui/ms-deepresearch/.venv/lib/python3.10/site-packages/sglang/utils.py", line 479, in __call__ [repeated 27x across cluster]    
(WorkerDict pid=1011655, ip=10.3.176.18)     return fn(obj) [repeated 27x across cluster]                                                                                                                 
(WorkerDict pid=1011655, ip=10.3.176.18)   File "/mnt/home/t-miazhang/gongrui/ms-deepresearch/.venv/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 874, in update_weights_f
rom_tensor [repeated 108x across cluster]                                                                                                                                                                 
(WorkerDict pid=1011655, ip=10.3.176.18)     success, message = self.tp_worker.update_weights_from_tensor(recv_req) [repeated 27x across cluster]                                                         
(WorkerDict pid=1011655, ip=10.3.176.18)     success, message = self.worker.update_weights_from_tensor(recv_req) [repeated 27x across cluster]                                                            
(WorkerDict pid=1011655, ip=10.3.176.18)     success, message = self.model_runner.update_weights_from_tensor( [repeated 27x across cluster]                                                               
(WorkerDict pid=1011655, ip=10.3.176.18)     named_tensors = [ [repeated 27x across cluster]                                                                                                              
(WorkerDict pid=1011655, ip=10.3.176.18)   File "/mnt/home/t-miazhang/gongrui/ms-deepresearch/.venv/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 875, in <listcomp> [repe
ated 27x across cluster]                                                                                                                                                                                  
(WorkerDict pid=1011655, ip=10.3.176.18)     (name, _unwrap_tensor(tensor, tp_rank=self.tp_rank)) [repeated 27x across cluster]                                                                           
(WorkerDict pid=1011655, ip=10.3.176.18)   File "/mnt/home/t-miazhang/gongrui/ms-deepresearch/.venv/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 1774, in _unwrap_tensor 
[repeated 27x across cluster]                                                                                                                                                                             
(WorkerDict pid=1011655, ip=10.3.176.18)     tensor = tensor.get(tp_rank) [repeated 27x across cluster]                                                                                                   (WorkerDict pid=1011655, ip=10.3.176.18)   File "/mnt/home/t-miazhang/gongrui/ms-deepresearch/.venv/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 1786, in get [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18)     return MultiprocessingSerializer.deserialize(self.values[rank]) [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18)   File "/mnt/home/t-miazhang/gongrui/ms-deepresearch/.venv/lib/python3.10/site-packages/sglang/srt/utils.py", line 1869, in deserialize [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18)     return ForkingPickler.loads(data) [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18)   File "/mnt/home/t-miazhang/gongrui/ms-deepresearch/.venv/lib/python3.10/site-packages/sglang/srt/patch_torch.py", line 51, in _rebuild_cuda_tensor_modified [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18)     return reductions._rebuild_cuda_tensor_original(*args) [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18)   File "/mnt/home/t-miazhang/gongrui/ms-deepresearch/.venv/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 181, in rebuild_cuda_tensor [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18)     storage = storage_cls._new_shared_cuda( [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18)   File "/mnt/home/t-miazhang/gongrui/ms-deepresearch/.venv/lib/python3.10/site-packages/torch/storage.py", line 1452, in _new_shared_cuda [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18)     return torch.UntypedStorage._new_shared_cuda(*args, **kwargs) [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18) RuntimeError: pidfd_getfd: Operation not permitted [repeated 27x across cluster]

However, the error disappeared when switching to vllm.

#2846 solves this by using fsdp2, but I don't know how to solve this when using megatron

Expected behavior

sglang and megatron should work together well

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions