Skip to content

[TPU] Make sure worker index aligns with node boundary #7485

@WoosukKwon

Description

@WoosukKwon

In ray gpu executor, there are these lines:

def sort_by_driver_then_worker_ip(worker):
"""
Sort the workers based on 3 properties:
1. If the worker is on the same node as the driver (vllm engine),
it should be placed first.
2. Then, if the worker is on a node with fewer workers, it should
be placed first.
3. Finally, if the work is on a node with smaller IP address, it
should be placed first.
"""
ip = ray.get(worker.get_node_ip.remote())
return (ip != driver_ip, ip_counts[ip], ip)
# After sorting, the workers on the same node will be
# close to each other, and the workers on the driver
# node will be placed first.
self.workers = sorted(self.workers, key=sort_by_driver_then_worker_ip)

to make sure the worker index aligns with machine boundary. you might need it in TPU, too. Otherwise local ranks can be wrong. for example, rank 0, 1, 2, 4 in one node, and 3, 5, 6, 7 in another node.

Originally posted by @youkaichao in #7457 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    tpuRelated to Google TPUs

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions