-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Description
Your current environment
The output of `python collect_env.py`
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31
Python version: 3.10.14 (main, Apr 6 2024, 18:45:05) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.4.119-19.0009.28-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA H800
GPU 1: NVIDIA H800
GPU 2: NVIDIA H800
GPU 3: NVIDIA H800
GPU 4: NVIDIA H800
GPU 5: NVIDIA H800
GPU 6: NVIDIA H800
GPU 7: NVIDIA H800
Nvidia driver version: 535.54.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 200
On-line CPU(s) list: 0-199
Thread(s) per core: 2
Core(s) per socket: 50
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 143
Model name: Intel(R) Xeon(R) Platinum 8480+
Stepping: 6
CPU MHz: 2000.000
BogoMIPS: 4000.00
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 4.7 MiB
L1i cache: 3.1 MiB
L2 cache: 200 MiB
L3 cache: 210 MiB
NUMA node0 CPU(s): 0-99
NUMA node1 CPU(s): 100-199
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb ibrs_enhanced fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 wbnoinvd arat avx512vbmi umip avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq movdiri movdir64b fsrm arch_capabilities
Versions of relevant libraries:
[pip3] flashinfer==0.1.6+cu121torch2.4
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.68
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.44.2
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.0@COMMIT_HASH_PLACEHOLDER
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV8 NV8 NV8 NV8 NV8 NV8 NV8 PIX NODE NODE NODE SYS SYS SYS SYS 0-99 0 N/A
GPU1 NV8 X NV8 NV8 NV8 NV8 NV8 NV8 NODE PIX NODE NODE SYS SYS SYS SYS 0-99 0 N/A
GPU2 NV8 NV8 X NV8 NV8 NV8 NV8 NV8 NODE NODE PIX NODE SYS SYS SYS SYS 0-99 0 N/A
GPU3 NV8 NV8 NV8 X NV8 NV8 NV8 NV8 NODE NODE NODE PIX SYS SYS SYS SYS 0-99 0 N/A
GPU4 NV8 NV8 NV8 NV8 X NV8 NV8 NV8 SYS SYS SYS SYS PIX NODE NODE NODE 100-199 1 N/A
GPU5 NV8 NV8 NV8 NV8 NV8 X NV8 NV8 SYS SYS SYS SYS NODE PIX NODE NODE 100-199 1 N/A
GPU6 NV8 NV8 NV8 NV8 NV8 NV8 X NV8 SYS SYS SYS SYS NODE NODE PIX NODE 100-199 1 N/A
GPU7 NV8 NV8 NV8 NV8 NV8 NV8 NV8 X SYS SYS SYS SYS NODE NODE NODE PIX 100-199 1 N/A
NIC0 PIX NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE SYS SYS SYS SYS
NIC1 NODE PIX NODE NODE SYS SYS SYS SYS NODE X NODE NODE SYS SYS SYS SYS
NIC2 NODE NODE PIX NODE SYS SYS SYS SYS NODE NODE X NODE SYS SYS SYS SYS
NIC3 NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE NODE X SYS SYS SYS SYS
NIC4 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE
NIC5 SYS SYS SYS SYS NODE PIX NODE NODE SYS SYS SYS SYS NODE X NODE NODE
NIC6 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS SYS SYS NODE NODE X NODE
NIC7 SYS SYS SYS SYS NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_bond_0
NIC1: mlx5_bond_1
NIC2: mlx5_bond_2
NIC3: mlx5_bond_3
NIC4: mlx5_bond_4
NIC5: mlx5_bond_5
NIC6: mlx5_bond_6
NIC7: mlx5_bond_7
Model Input Dumps
No response
🐛 Describe the bug
running command:
export NCCL_DEBUG=INFO
export CUDA_LAUNCH_BLOCKING=1
export VLLM_TRACE_FUNCTION=1
python3 api_server.py
--model my_model
-tp 8
-pp 2
--enforce-eager
--max-num-seqs=32
--dtype=bfloat16
--worker-use-ray
--gpu-memory-utilization 0.8
logs are hanging here:
INFO 09-12 23:40:47 utils.py:977] Found nccl from library libnccl.so.2
INFO 09-12 23:40:47 pynccl.py:63] vLLM is using nccl==2.20.5
VM-160-69-tencentos:561:561 [0] NCCL INFO Using non-device net plugin version 0
VM-160-69-tencentos:561:561 [0] NCCL INFO Using network IB
VM-160-69-tencentos:561:561 [0] NCCL INFO comm 0x1476dca0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 23000 commId 0xa015ae7e0a50005d - Init START
(RayWorkerWrapper pid=248, ip=10.1.160.68) INFO 09-12 23:40:47 utils.py:977] Found nccl from library libnccl.so.2
(RayWorkerWrapper pid=248, ip=10.1.160.68) INFO 09-12 23:40:47 pynccl.py:63] vLLM is using nccl==2.20.5
VM-160-69-tencentos:561:561 [0] NCCL INFO Setting affinity for GPU 0 to 0f,ffffffff,ffffffff,ffffffff
VM-160-69-tencentos:561:561 [0] NCCL INFO comm 0x1476dca0 rank 0 nRanks 2 nNodes 2 localRanks 1 localRank 0 MNNVL 0
VM-160-69-tencentos:561:561 [0] NCCL INFO Channel 00/02 : 0 1
VM-160-69-tencentos:561:561 [0] NCCL INFO Channel 01/02 : 0 1
VM-160-69-tencentos:561:561 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
VM-160-69-tencentos:561:561 [0] NCCL INFO P2P Chunksize set to 131072
VM-160-69-tencentos:561:561 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [receive] via NET/IB/1/GDRDMA
VM-160-69-tencentos:561:561 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [receive] via NET/IB/1/GDRDMA
VM-160-69-tencentos:561:561 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [send] via NET/IB/1/GDRDMA
VM-160-69-tencentos:561:561 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [send] via NET/IB/1/GDRDMA
VM-160-69-tencentos:561:561 [0] NCCL INFO Connected all rings
VM-160-69-tencentos:561:561 [0] NCCL INFO Connected all trees
VM-160-69-tencentos:561:561 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
VM-160-69-tencentos:561:561 [0] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
VM-160-69-tencentos:561:561 [0] NCCL INFO comm 0x1476dca0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 23000 commId 0xa015ae7e0a50005d - Init COMPLETE
VM-160-69-tencentos:561:13506 [0] transport/net_ib.cc:1698 NCCL WARN NET/IB : Got completion from peer 10.1.160.68<57552> with status=12 opcode=0 len=0 vendor err 129 (Recv) localGid fe80::a288:c2ff:fe16:5c9c remoteGidsfe80::a288:c2ff:fe16:4e4c
VM-160-69-tencentos:561:13506 [0] NCCL INFO transport/net.cc:1298 -> 6
VM-160-69-tencentos:561:13506 [0] NCCL INFO proxy.cc:694 -> 6
VM-160-69-tencentos:561:13506 [0] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]
VM-160-69-tencentos:561:13506 [0] transport/net_ib.cc:1698 NCCL WARN NET/IB : Got completion from peer 10.1.160.68<57552> with status=5 opcode=0 len=0 vendor err 249 (Recv) localGid fe80::a288:c2ff:fe16:5c9c remoteGidsfe80::a288:c2ff:fe16:4e4c
VM-160-69-tencentos:561:13506 [0] NCCL INFO transport/net.cc:1298 -> 6
VM-160-69-tencentos:561:13506 [0] NCCL INFO proxy.cc:694 -> 6
VM-160-69-tencentos:561:13506 [0] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]
VM-160-69-tencentos:561:13506 [0] transport/net_ib.cc:1698 NCCL WARN NET/IB : Got completion from peer 10.1.160.68<57552> with status=5 opcode=0 len=0 vendor err 244 (Recv) localGid fe80::a288:c2ff:fe16:5c9c remoteGidsfe80::a288:c2ff:fe16:4e4c
VM-160-69-tencentos:561:13506 [0] NCCL INFO transport/net.cc:1298 -> 6
VM-160-69-tencentos:561:13506 [0] NCCL INFO proxy.cc:694 -> 6
VM-160-69-tencentos:561:13506 [0] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]
VM-160-69-tencentos:561:13506 [0] transport/net_ib.cc:1698 NCCL WARN NET/IB : Got completion from peer 10.1.160.68<57552> with status=5 opcode=0 len=0 vendor err 249 (Recv) localGid fe80::a288:c2ff:fe16:5c9c remoteGidsfe80::a288:c2ff:fe16:4e4c
VM-160-69-tencentos:561:13506 [0] NCCL INFO transport/net.cc:1298 -> 6
VM-160-69-tencentos:561:13506 [0] NCCL INFO proxy.cc:694 -> 6
VM-160-69-tencentos:561:13506 [0] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.