Skip to content

Commit 216ca2a

Browse files
[sglang,rollout] fix: sglang port race condition (#3977)
This PR fixes a race condition when running verl with sglang at scale (> 200 nodes). The race condition happens when multiple sglang processes running on the same node try to listen on the same (randomly assigned) nccl_port. To avoid port conflicts verl does assign the port number of each SGLang process based on the unique rank id, however this only applies to the http port while the nccl_port remains unspecified, hence randomly assigned. ## Background Each SGLang engine started by verl does typically open one or two sockets to listen for incoming connections. The first socket is for http requests and the listening port is assigned via the `--port` parameter. The second socket might be opened to performe the NCCL randevu and can be assigned via the `--nccl_port` parameter. In case the nccl_port is not assigned SGLang will randomly select one. This [code](https://github.com/sgl-project/sglang/blob/1ed1abfd456c58697e72adc2dd831936dc188fec/python/sglang/srt/server_args.py#L3924) is responsible to randomly asigne the nccl port: ``` nccl_port = server_args.port + random.randint(100, 1000) ``` Unfortunately with hundreds of SGLang processes running concurrently, a race condition might happen with two or more processes selecting the same port. ## Fix This PR fixes the nccl_port race condition by pre-assigning a unique port based on the rank id. Hence for each sglang processes both the http and the nccl port are uniquely assigned based on the rank id. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
1 parent b49178f commit 216ca2a

File tree

1 file changed

+3
-2
lines changed

1 file changed

+3
-2
lines changed

verl/workers/rollout/sglang_rollout/sglang_rollout.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -434,8 +434,8 @@ def _init_inference_engine(self, trust_remote_code, actor_module, port):
434434
if self.config.mode == "async" and not self.config.skip_tokenizer_init:
435435
raise ValueError("async mode requires skip_tokenizer_init to be True")
436436
backend = attention_backend if attention_backend is not None else "fa3"
437+
sglang_port = int(os.getenv("SGLANG_PORT", "30000")) + (dist.get_rank() * 2)
437438
if effective_first:
438-
rank = dist.get_rank()
439439
os.environ["SGLANG_BLOCK_NONZERO_RANK_CHILDREN"] = "0"
440440
args = {
441441
"model_path": actor_module,
@@ -453,7 +453,8 @@ def _init_inference_engine(self, trust_remote_code, actor_module, port):
453453
"max_running_requests": max_running_requests,
454454
# NOTE(linjunrong): add rank to prevent SGLang generate same port inside PortArgs.init_new
455455
# when random.seed is being set during training
456-
"port": 30000 + rank,
456+
"port": sglang_port,
457+
"nccl_port": sglang_port + 1,
457458
# NOTE(Chenyang): if you want to debug the SGLang engine output
458459
# please set the following parameters
459460
# Otherwise, it will make the engine run too slow

0 commit comments

Comments
 (0)