-
-
Notifications
You must be signed in to change notification settings - Fork 9.6k
Closed
Labels
tpuRelated to Google TPUsRelated to Google TPUs
Description
In ray gpu executor, there are these lines:
vllm/vllm/executor/ray_gpu_executor.py
Lines 175 to 191 in 7025b11
def sort_by_driver_then_worker_ip(worker): | |
""" | |
Sort the workers based on 3 properties: | |
1. If the worker is on the same node as the driver (vllm engine), | |
it should be placed first. | |
2. Then, if the worker is on a node with fewer workers, it should | |
be placed first. | |
3. Finally, if the work is on a node with smaller IP address, it | |
should be placed first. | |
""" | |
ip = ray.get(worker.get_node_ip.remote()) | |
return (ip != driver_ip, ip_counts[ip], ip) | |
# After sorting, the workers on the same node will be | |
# close to each other, and the workers on the driver | |
# node will be placed first. | |
self.workers = sorted(self.workers, key=sort_by_driver_then_worker_ip) |
to make sure the worker index aligns with machine boundary. you might need it in TPU, too. Otherwise local ranks can be wrong. for example, rank 0, 1, 2, 4 in one node, and 3, 5, 6, 7 in another node.
Originally posted by @youkaichao in #7457 (comment)
Metadata
Metadata
Assignees
Labels
tpuRelated to Google TPUsRelated to Google TPUs