-
-
Notifications
You must be signed in to change notification settings - Fork 9.3k
[TPU] Align worker index with node boundary #7932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
is it enough to remove the following lines? vllm/vllm/worker/tpu_model_runner.py Lines 132 to 135 in 9c71c97
|
@youkaichao I tried it, but still got gibberish results without the patch. I think this is because the rank IDs used in all gather are assigned by XLA runtime, regardless of IPs. |
I will not block this pr for that reason, but it would be better to know in the future, what's the rank used by XLA. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, please fix the format
Hi, I still get this error due to the mismatched global world size even on this branch(tpu-rank) version. I use these commands below to build new docker image based on Dockerfile.tpu and use the code of this branch. # on main node, main vm IP:PORT is 10.130.0.60:6379
sudo docker run -t -d -e HF_TOKEN=INPUT_YOUR_TOKEN --privileged --net host --shm-size=16G -it vllm ray start --head --block
# on other nodes
sudo docker run -t -d -e VLLM_HOST_IP=10.130.0.60 -e HF_TOKEN=YOUR_TOKEN -e --privileged --net host --shm-size=16G -it vllm ray start --address 10.130.0.60:6379 --block after checking this ray status -- to confirm all the tpu cores are gathered in ray(32 cores since I use TPUv4-64 for this exp)
However when I run this command to run vllm server --
this error comes.
Because of the code here: https://github.com/vllm-project/vllm/blob/tpu-rank/vllm/distributed/device_communicators/tpu_communicator.py#L28 it causes class TpuCommunicator:
....
# NOTE(woosuk): When using TP > 1 on TPUs, every TPU on the same node
# must be used together. Therefore, the local rank and world size can
# be simply calculated as follows.
global_rank = dist.get_rank(group) # <- this global rank wil be 0 on main node
global_world_size = dist.get_world_size(group) # <- this size should be 32 but it is 4(single node's TPU cores)
num_nodes = len(ray.nodes()) # <- correct value, 8 for TPUv4-64
local_world_size = global_world_size // num_nodes # this line cases err since 4 // 8 == 0, causing later line ZeroDivision Error.
local_rank = global_rank % local_world_size
pjrt.initialize_multiprocess(local_rank, local_world_size)
xr._init_world_size_ordinal() I think pytorch distributed |
@WoosukKwon I think this problem comes from these lines: vllm/vllm/distributed/parallel_state.py Lines 121 to 132 in c166e7e
for ranks in group_ranks:
device_group = torch.distributed.new_group(
ranks, backend=torch_distributed_backend)
# a group with `gloo` backend, to allow direct coordination between
# processes through the CPU.
cpu_group = torch.distributed.new_group(ranks, backend="gloo")
if self.rank in ranks:
self.ranks = ranks
self.world_size = len(ranks)
self.rank_in_group = ranks.index(self.rank)
self.device_group = device_group
self.cpu_group = cpu_group I think this codes are supposed to create local cpu group for optmized rank (such as TP in 1 node) but I think this cases cpu_group to have only 4 process(=1 node TPU cores), making Zero Division err to the code above I mentioned. is there suggested way to workaround this issue? |
#7929 fixes the CI failure related to mypy, please merge from |
@Beomi Thanks for trying out this PR. Currently, vLLM's TPU backend does not support PP. Could you please use smaller TPU pod and retry with the updated main branch? |
@WoosukKwon Thanks for clarification! it seems like TP over all nodes works :) |
@WoosukKwon BTW, there is weird situation related with Multinode situation -- the code works well with TPUv4-8/v4-16/v4-32 but it fails to launch worker on one node if I run on v4-64. I'll open it on the other issue :) |
[TPU] Align worker index with node boundary (vllm-project#7932)
TPU has very good cross-node inter connect. I don't think it needs PP. |
Signed-off-by: Alvant <[email protected]>
Signed-off-by: LeiWang1999 <[email protected]>
Fixes #7485