Skip to content

Conversation

@theely
Copy link
Contributor

@theely theely commented Oct 31, 2025

This PR fixes a race condition when running verl with sglang at scale (> 200 nodes).

The race condition happens when multiple sglang processes running on the same node try to listen on the same (randomly assigned) nccl_port.

To avoid port conflicts verl does assign the port number of each SGLang process based on the unique rank id, however this only applies to the http port while the nccl_port remains unspecified, hence randomly assigned.

Background

Each SGLang engine started by verl does typically open one or two sockets to listen for incoming connections. The first socket is for http requests and the listening port is assigned via the --port parameter. The second socket might be opened to performe the NCCL randevu and can be assigned via the --nccl_port parameter.

In case the nccl_port is not assigned SGLang will randomly select one. This code is responsible to randomly asigne the nccl port:

nccl_port = server_args.port + random.randint(100, 1000)

Unfortunately with hundreds of SGLang processes running concurrently, a race condition might happen with two or more processes selecting the same port.

Fix

This PR fixes the nccl_port race condition by pre-assigning a unique port based on the rank id. Hence for each sglang processes both the http and the nccl port are uniquely assigned based on the rank id.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly identifies and aims to fix a race condition for nccl_port assignment in sglang by pre-assigning unique ports based on the process rank. The approach of allocating a pair of ports per rank is sound. However, I've found a critical issue in the implementation that will cause a TypeError when the SGLANG_PORT environment variable is set, which I've detailed in a specific comment with a suggested fix.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@wuxibin89 wuxibin89 changed the title Fix sglang port race condition [sglang,rollout] fix: sglang port race condition Nov 3, 2025
@wuxibin89
Copy link
Collaborator

@theely Please format code with: https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting

@wuxibin89 wuxibin89 merged commit 216ca2a into volcengine:main Nov 4, 2025
71 of 73 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants