[sglang,rollout] fix: sglang port race condition #3977
                
     Merged
            
            
          
      
        
          +3
        
        
          −2
        
        
          
        
      
    
  
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
This PR fixes a race condition when running verl with sglang at scale (> 200 nodes).
The race condition happens when multiple sglang processes running on the same node try to listen on the same (randomly assigned) nccl_port.
To avoid port conflicts verl does assign the port number of each SGLang process based on the unique rank id, however this only applies to the http port while the nccl_port remains unspecified, hence randomly assigned.
Background
Each SGLang engine started by verl does typically open one or two sockets to listen for incoming connections. The first socket is for http requests and the listening port is assigned via the
--portparameter. The second socket might be opened to performe the NCCL randevu and can be assigned via the--nccl_portparameter.In case the nccl_port is not assigned SGLang will randomly select one. This code is responsible to randomly asigne the nccl port:
Unfortunately with hundreds of SGLang processes running concurrently, a race condition might happen with two or more processes selecting the same port.
Fix
This PR fixes the nccl_port race condition by pre-assigning a unique port based on the rank id. Hence for each sglang processes both the http and the nccl port are uniquely assigned based on the rank id.