Skip to content

Commit c151a0c

Browse files
youkaichaoLeiWang1999
authored andcommitted
[doc] update doc on testing and debugging (vllm-project#8514)
Signed-off-by: LeiWang1999 <[email protected]>
1 parent 894512e commit c151a0c

File tree

1 file changed

+7
-0
lines changed

1 file changed

+7
-0
lines changed

docs/source/getting_started/debugging.rst

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,13 @@ Here are some common issues that can cause hangs:
9898

9999
If the script runs successfully, you should see the message ``sanity check is successful!``.
100100

101+
Note that multi-node environment is more complicated than single-node. If you see errors such as ``torch.distributed.DistNetworkError``, it is likely that the network/DNS setup is incorrect. In that case, you can manually assign node rank and specify the IP via command line arguments:
102+
103+
- In the first node, run ``NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 0 --master_addr $MASTER_ADDR test.py``.
104+
- In the second node, run ``NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 1 --master_addr $MASTER_ADDR test.py``.
105+
106+
Adjust ``--nproc-per-node``, ``--nnodes``, and ``--node-rank`` according to your setup. The difference is that you need to execute different commands (with different ``--node-rank``) on different nodes.
107+
101108
If the problem persists, feel free to `open an issue on GitHub <https://github.com/vllm-project/vllm/issues/new/choose>`_, with a detailed description of the issue, your environment, and the logs.
102109

103110
Some known issues:

0 commit comments

Comments
 (0)