RuntimeError: pidfd_getfd: Operation not permitted

### System Info

----------Python Info----------
Version      : 3.10.12
Compiler     : GCC 11.4.0
Build        : ('main', 'Nov 20 2023 15:14:05')
Arch         : ('64bit', 'ELF')
------------Pip Info-----------
No corresponding pip install for current python.
vllm         : 0.10.0
sglang       : 0.4.10.post2
ray          : 2.49.1
torch        : 2.7.1
----------verl Info-----------
Version      : 0.5.0.dev
Directory    : /mnt/home/t-miazhang/gongrui/ms-deepresearch/verl/verl
Commit Hash  : 2d6c6dbb39bf846d4ebf98c89fc5b4f49c37dd3d
----------Platform Info----------
Platform     : Linux-6.5.13-65-650-4141-22041-coreweave-amd64-85c45edc-x86_64-with-glibc2.35
system       : Linux
node         : h100-226-147
release      : 6.5.13-65-650-4141-22041-coreweave-amd64-85c45edc
version      : #1 SMP PREEMPT_DYNAMIC Mon Oct 14 20:37:13 UTC 2024
----------Environment----------
CUDA Runtime : 12.6
CUDA Compiler : Cuda compilation tools, release 12.4, V12.4.131
----------System Info----------
CPU Memory      : 2014.42 GB
GPU Count       : 8
GPU 1   Type    : NVIDIA H100 80GB HBM3
GPU 1   Memory  : 79.65 GB
GPU 2   Type    : NVIDIA H100 80GB HBM3
GPU 2   Memory  : 79.65 GB
GPU 3   Type    : NVIDIA H100 80GB HBM3
GPU 3   Memory  : 79.65 GB
GPU 4   Type    : NVIDIA H100 80GB HBM3
GPU 4   Memory  : 79.65 GB
GPU 5   Type    : NVIDIA H100 80GB HBM3
GPU 5   Memory  : 79.65 GB
GPU 6   Type    : NVIDIA H100 80GB HBM3
GPU 6   Memory  : 79.65 GB
GPU 7   Type    : NVIDIA H100 80GB HBM3
GPU 7   Memory  : 79.65 GB
GPU 8   Type    : NVIDIA H100 80GB HBM3
GPU 8   Memory  : 79.65 GB

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

When using sglang for async mode rollout, I encounter this error:
```
(WorkerDict pid=1011655, ip=10.3.176.18)   File "/mnt/home/t-miazhang/gongrui/ms-deepresearch/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context [repeated 27x 
across cluster]                                                                                                                                                                                           
(WorkerDict pid=1011655, ip=10.3.176.18)     return func(*args, **kwargs) [repeated 27x across cluster]                                                                                                   
(WorkerDict pid=1011655, ip=10.3.176.18)   File "/mnt/home/t-miazhang/gongrui/ms-deepresearch/.venv/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 778, in event_loop_overlap [repea
ted 27x across cluster]                                                                                                                                                                                   
(WorkerDict pid=1011655, ip=10.3.176.18)     self.process_input_requests(recv_reqs) [repeated 27x across cluster]                                                                                         
(WorkerDict pid=1011655, ip=10.3.176.18)   File "/mnt/home/t-miazhang/gongrui/ms-deepresearch/.venv/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1050, in process_input_requests [
repeated 27x across cluster]                                                                                                                                                                              
(WorkerDict pid=1011655, ip=10.3.176.18)     output = self._request_dispatcher(recv_req) [repeated 27x across cluster]                                                                                    
(WorkerDict pid=1011655, ip=10.3.176.18)   File "/mnt/home/t-miazhang/gongrui/ms-deepresearch/.venv/lib/python3.10/site-packages/sglang/utils.py", line 479, in __call__ [repeated 27x across cluster]    
(WorkerDict pid=1011655, ip=10.3.176.18)     return fn(obj) [repeated 27x across cluster]                                                                                                                 
(WorkerDict pid=1011655, ip=10.3.176.18)   File "/mnt/home/t-miazhang/gongrui/ms-deepresearch/.venv/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 874, in update_weights_f
rom_tensor [repeated 108x across cluster]                                                                                                                                                                 
(WorkerDict pid=1011655, ip=10.3.176.18)     success, message = self.tp_worker.update_weights_from_tensor(recv_req) [repeated 27x across cluster]                                                         
(WorkerDict pid=1011655, ip=10.3.176.18)     success, message = self.worker.update_weights_from_tensor(recv_req) [repeated 27x across cluster]                                                            
(WorkerDict pid=1011655, ip=10.3.176.18)     success, message = self.model_runner.update_weights_from_tensor( [repeated 27x across cluster]                                                               
(WorkerDict pid=1011655, ip=10.3.176.18)     named_tensors = [ [repeated 27x across cluster]                                                                                                              
(WorkerDict pid=1011655, ip=10.3.176.18)   File "/mnt/home/t-miazhang/gongrui/ms-deepresearch/.venv/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 875, in <listcomp> [repe
ated 27x across cluster]                                                                                                                                                                                  
(WorkerDict pid=1011655, ip=10.3.176.18)     (name, _unwrap_tensor(tensor, tp_rank=self.tp_rank)) [repeated 27x across cluster]                                                                           
(WorkerDict pid=1011655, ip=10.3.176.18)   File "/mnt/home/t-miazhang/gongrui/ms-deepresearch/.venv/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 1774, in _unwrap_tensor 
[repeated 27x across cluster]                                                                                                                                                                             
(WorkerDict pid=1011655, ip=10.3.176.18)     tensor = tensor.get(tp_rank) [repeated 27x across cluster]                                                                                                   (WorkerDict pid=1011655, ip=10.3.176.18)   File "/mnt/home/t-miazhang/gongrui/ms-deepresearch/.venv/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 1786, in get [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18)     return MultiprocessingSerializer.deserialize(self.values[rank]) [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18)   File "/mnt/home/t-miazhang/gongrui/ms-deepresearch/.venv/lib/python3.10/site-packages/sglang/srt/utils.py", line 1869, in deserialize [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18)     return ForkingPickler.loads(data) [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18)   File "/mnt/home/t-miazhang/gongrui/ms-deepresearch/.venv/lib/python3.10/site-packages/sglang/srt/patch_torch.py", line 51, in _rebuild_cuda_tensor_modified [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18)     return reductions._rebuild_cuda_tensor_original(*args) [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18)   File "/mnt/home/t-miazhang/gongrui/ms-deepresearch/.venv/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 181, in rebuild_cuda_tensor [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18)     storage = storage_cls._new_shared_cuda( [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18)   File "/mnt/home/t-miazhang/gongrui/ms-deepresearch/.venv/lib/python3.10/site-packages/torch/storage.py", line 1452, in _new_shared_cuda [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18)     return torch.UntypedStorage._new_shared_cuda(*args, **kwargs) [repeated 27x across cluster]
(WorkerDict pid=1011655, ip=10.3.176.18) RuntimeError: pidfd_getfd: Operation not permitted [repeated 27x across cluster]
```
However, the error disappeared when switching to vllm.

#2846 solves this by using fsdp2, but I don't know how to solve this when using megatron

### Expected behavior

sglang and megatron should work together well

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RuntimeError: pidfd_getfd: Operation not permitted #3377

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RuntimeError: pidfd_getfd: Operation not permitted #3377

Description

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions