Skip to content

RuntimeError: pidfd_getfd: Operation not permitted #2846

@dark2luminosity

Description

@dark2luminosity

Has anyone encountered this kind of error?Has anyone encountered this kind of error?Has anyone encountered this kind of error?Has anyone encountered this kind of error?

Capturing batches (avail_mem=32.75 GB): 0%| | 0/23 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|█████ | 1/2 [00:03<00:03, 3.85s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:05<00:00, 2.98s/it] [repeated 7x across cluster]
Capturing batches (avail_mem=32.55 GB): 4%|▍ | 1/23 [00:00<00:11, 1.84it/s]
Capturing batches (avail_mem=32.75 GB): 4%|▍ | 1/23 [00:00<00:12, 1.81it/s]
Capturing batches (avail_mem=32.55 GB): 4%|▍ | 1/23 [00:00<00:12, 1.81it/s]
Capturing batches (avail_mem=32.75 GB): 0%| | 0/23 [00:00<?, ?it/s] [repeated 7x across cluster]
Capturing batches (avail_mem=31.18 GB): 83%|████████▎ | 19/23 [00:05<00:01, 3.78it/s] [repeated 145x across cluster]
Capturing batches (avail_mem=31.59 GB): 57%|█████▋ | 13/23 [00:03<00:02, 3.76it/s] [repeated 4x across cluster]
Capturing batches (avail_mem=31.45 GB): 91%|█████████▏| 21/23 [00:05<00:00, 3.63it/s]
Capturing batches (avail_mem=31.45 GB): 100%|██████████| 23/23 [00:06<00:00, 3.69it/s]
Capturing batches (avail_mem=31.45 GB): 100%|██████████| 23/23 [00:06<00:00, 3.56it/s]
(WorkerDict pid=3546248) /home/abibulla/anaconda3/envs/verl/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
(WorkerDict pid=3546248) warnings.warn(
(TaskRunner pid=3536185) wandb: Tracking run with wandb version 0.21.0
(TaskRunner pid=3536185) wandb: W&B syncing is set to offline in this directory. Run wandb online or set WANDB_MODE=online to enable cloud syncing.
(TaskRunner pid=3536185) Checkpoint tracker file does not exist: /data/abibulla/tester/verl/checkpoints/search_r1_like_async_rl/qwen2.5-3b-instruct_function_rm-search-async-sgl-multi-w-searchtool-verify-n16/latest_checkpointed_iteration.txt
(TaskRunner pid=3536185) Training from scratch
(WorkerDict pid=3546253) { [repeated 7x across cluster]
(WorkerDict pid=3546253) "type": "function", [repeated 7x across cluster]
(WorkerDict pid=3546253) "function": { [repeated 7x across cluster]
(WorkerDict pid=3546253) "name": "search", [repeated 7x across cluster]
(WorkerDict pid=3546253) "description": "Searches the web for relevant information based on the given query.", [repeated 7x across cluster]
(WorkerDict pid=3546253) "parameters": { [repeated 7x across cluster]
(WorkerDict pid=3546253) "type": "object", [repeated 7x across cluster]
(WorkerDict pid=3546253) "properties": { [repeated 7x across cluster]
(WorkerDict pid=3546253) "query_list": { [repeated 7x across cluster]
(WorkerDict pid=3546253) "type": "array", [repeated 7x across cluster]
(WorkerDict pid=3546253) "description": "A list of fully-formed semantic queries. The tool will return search results for each query." [repeated 7x across cluster]
(WorkerDict pid=3546253) } [repeated 28x across cluster]
(WorkerDict pid=3546253) }, [repeated 7x across cluster]
(WorkerDict pid=3546253) "required": [ [repeated 7x across cluster]
(WorkerDict pid=3546253) "query_list" [repeated 7x across cluster]
(WorkerDict pid=3546253) ] [repeated 7x across cluster]
Training Progress: 0%| | 0/2650 [00:00<?, ?it/s]
Capturing batches (avail_mem=31.46 GB): 87%|████████▋ | 20/23 [00:06<00:00, 3.37it/s] [repeated 10x across cluster]
Capturing batches (avail_mem=31.46 GB): 87%|████████▋ | 20/23 [00:05<00:00, 3.64it/s] [repeated 2x across cluster]
(WorkerDict pid=3545763) [2025-07-31 22:08:54] Scheduler hit an exception: Traceback (most recent call last):
(WorkerDict pid=3545763) File "/home/abibulla/anaconda3/envs/verl/lib/python3.11/site-packages/sglang/srt/managers/scheduler.py", line 2311, in run_scheduler_process
(WorkerDict pid=3545763) scheduler.event_loop_overlap()
(WorkerDict pid=3545763) File "/home/abibulla/anaconda3/envs/verl/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(WorkerDict pid=3545763) return func(*args, **kwargs)
(WorkerDict pid=3545763) ^^^^^^^^^^^^^^^^^^^^^
(WorkerDict pid=3545763) File "/home/abibulla/anaconda3/envs/verl/lib/python3.11/site-packages/sglang/srt/managers/scheduler.py", line 662, in event_loop_overlap
(WorkerDict pid=3545763) self.process_input_requests(recv_reqs)
(WorkerDict pid=3545763) File "/home/abibulla/anaconda3/envs/verl/lib/python3.11/site-packages/sglang/srt/managers/scheduler.py", line 889, in process_input_requests
(WorkerDict pid=3545763) output = self._request_dispatcher(recv_req)
(WorkerDict pid=3545763) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(WorkerDict pid=3545763) File "/home/abibulla/anaconda3/envs/verl/lib/python3.11/site-packages/sglang/utils.py", line 471, in call
(WorkerDict pid=3545763) return fn(obj)
(WorkerDict pid=3545763) ^^^^^^^
(WorkerDict pid=3545763) File "/home/abibulla/anaconda3/envs/verl/lib/python3.11/site-packages/sglang/srt/managers/scheduler.py", line 2035, in update_weights_from_tensor
(WorkerDict pid=3545763) success, message = self.tp_worker.update_weights_from_tensor(recv_req)
(WorkerDict pid=3545763) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(WorkerDict pid=3545763) File "/home/abibulla/anaconda3/envs/verl/lib/python3.11/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 254, in update_weights_from_tensor
(WorkerDict pid=3545763) success, message = self.worker.update_weights_from_tensor(recv_req)
(WorkerDict pid=3545763) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(WorkerDict pid=3545763) File "/home/abibulla/anaconda3/envs/verl/lib/python3.11/site-packages/sglang/srt/managers/tp_worker.py", line 255, in update_weights_from_tensor
(WorkerDict pid=3545763) success, message = self.model_runner.update_weights_from_tensor(
(WorkerDict pid=3545763) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(WorkerDict pid=3545763) File "/home/abibulla/anaconda3/envs/verl/lib/python3.11/site-packages/sglang/srt/model_executor/model_runner.py", line 742, in update_weights_from_tensor
(WorkerDict pid=3545763) named_tensors = [
(WorkerDict pid=3545763) ^
(WorkerDict pid=3545763) File "/home/abibulla/anaconda3/envs/verl/lib/python3.11/site-packages/sglang/srt/model_executor/model_runner.py", line 743, in
(WorkerDict pid=3545763) (name, _unwrap_tensor(tensor, tp_rank=self.tp_rank))
(WorkerDict pid=3545763) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(WorkerDict pid=3545763) File "/home/abibulla/anaconda3/envs/verl/lib/python3.11/site-packages/sglang/srt/model_executor/model_runner.py", line 1296, in _unwrap_tensor
(WorkerDict pid=3545763) tensor = tensor.get(tp_rank)
(WorkerDict pid=3545763) ^^^^^^^^^^^^^^^^^^^
(WorkerDict pid=3545763) File "/home/abibulla/anaconda3/envs/verl/lib/python3.11/site-packages/sglang/srt/model_executor/model_runner.py", line 1308, in get
(WorkerDict pid=3545763) return MultiprocessingSerializer.deserialize(self.values[rank])
(WorkerDict pid=3545763) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(WorkerDict pid=3545763) File "/home/abibulla/anaconda3/envs/verl/lib/python3.11/site-packages/sglang/srt/utils.py", line 1672, in deserialize
(WorkerDict pid=3545763) return ForkingPickler.loads(data)
(WorkerDict pid=3545763) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(WorkerDict pid=3545763) File "/home/abibulla/anaconda3/envs/verl/lib/python3.11/site-packages/sglang/srt/patch_torch.py", line 51, in _rebuild_cuda_tensor_modified
(WorkerDict pid=3545763) return reductions._rebuild_cuda_tensor_original(*args)
(WorkerDict pid=3545763) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(WorkerDict pid=3545763) File "/home/abibulla/anaconda3/envs/verl/lib/python3.11/site-packages/torch/multiprocessing/reductions.py", line 181, in rebuild_cuda_tensor
(WorkerDict pid=3545763) storage = storage_cls._new_shared_cuda(
(WorkerDict pid=3545763) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(WorkerDict pid=3545763) File "/home/abibulla/anaconda3/envs/verl/lib/python3.11/site-packages/torch/storage.py", line 1452, in _new_shared_cuda
(WorkerDict pid=3545763) return torch.UntypedStorage._new_shared_cuda(*args, **kwargs)
(WorkerDict pid=3545763) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(WorkerDict pid=3545763) RuntimeError: pidfd_getfd: Operation not permitted

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions