[sglang,rollout] fix: sglang port race condition (#3977)

theely · gemini-code-assist[bot] · web-flow · commit 216ca2aa812b · 2025-11-04T10:11:34.000+08:00
This PR fixes a race condition when running verl with sglang at scale (> 200 nodes). The race condition happens when multiple sglang processes running on the same node try to listen on the same (randomly assigned) nccl_port. To avoid port conflicts verl does assign the port number of each SGLang process based on the unique rank id, however this only applies to the http port while the nccl_port remains unspecified, hence randomly assigned. ## Background Each SGLang engine started by verl does typically open one or two sockets to listen for incoming connections. The first socket is for http requests and the listening port is assigned via the `--port` parameter. The second socket might be opened to performe the NCCL randevu and can be assigned via the `--nccl_port` parameter. In case the nccl_port is not assigned SGLang will randomly select one. This [code](https://github.com/sgl-project/sglang/blob/1ed1abfd456c58697e72adc2dd831936dc188fec/python/sglang/srt/server_args.py#L3924) is responsible to randomly asigne the nccl port: ``` nccl_port = server_args.port + random.randint(100, 1000) ``` Unfortunately with hundreds of SGLang processes running concurrently, a race condition might happen with two or more processes selecting the same port. ## Fix This PR fixes the nccl_port race condition by pre-assigning a unique port based on the rank id. Hence for each sglang processes both the http and the nccl port are uniquely assigned based on the rank id. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
diff --git a/verl/workers/rollout/sglang_rollout/sglang_rollout.py b/verl/workers/rollout/sglang_rollout/sglang_rollout.py
@@ -434,8 +434,8 @@ def _init_inference_engine(self, trust_remote_code, actor_module, port):
         if self.config.mode == "async" and not self.config.skip_tokenizer_init:
             raise ValueError("async mode requires skip_tokenizer_init to be True")
         backend = attention_backend if attention_backend is not None else "fa3"
+        sglang_port = int(os.getenv("SGLANG_PORT", "30000")) + (dist.get_rank() * 2)
         if effective_first:
-            rank = dist.get_rank()
             os.environ["SGLANG_BLOCK_NONZERO_RANK_CHILDREN"] = "0"
             args = {
                 "model_path": actor_module,
@@ -453,7 +453,8 @@ def _init_inference_engine(self, trust_remote_code, actor_module, port):
                 "max_running_requests": max_running_requests,
                 # NOTE(linjunrong): add rank to prevent SGLang generate same port inside PortArgs.init_new
                 # when random.seed is being set during training
-                "port": 30000 + rank,
+                "port": sglang_port,
+                "nccl_port": sglang_port + 1,
                 # NOTE(Chenyang): if you want to debug the SGLang engine output
                 # please set the following parameters
                 # Otherwise, it will make the engine run too slow