[Bug]: Model serving failed with these arguments --tensor-parallel-size 2 --pipeline-parallel-size 2

### Your current environment

<details>
<summary>The output of `python collect_env.py`</summary>

```text
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.30.2
Libc version: glibc-2.31

Python version: 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-1066-aws-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA L4
GPU 1: NVIDIA L4
GPU 2: NVIDIA L4
GPU 3: NVIDIA L4

Nvidia driver version: 535.183.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.2
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.2
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.2
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.2
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.2
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.2
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.2
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Byte Order:                           Little Endian
Address sizes:                        48 bits physical, 48 bits virtual
CPU(s):                               48
On-line CPU(s) list:                  0-47
Thread(s) per core:                   2
Core(s) per socket:                   24
Socket(s):                            1
NUMA node(s):                         1
Vendor ID:                            AuthenticAMD
CPU family:                           25
Model:                                1
Model name:                           AMD EPYC 7R13 Processor
Stepping:                             1
CPU MHz:                              2322.855
BogoMIPS:                             5299.99
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            768 KiB
L1i cache:                            768 KiB
L2 cache:                             12 MiB
L3 cache:                             96 MiB
NUMA node0 CPU(s):                    0-47
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Mitigation; safe RET, no microcode
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] pyzmq==26.1.0
[pip3] torch==2.4.0
[pip3] torch-model-archiver==0.11.0
[pip3] torchaudio==2.3.0+cu121
[pip3] torchserve==0.11.0
[pip3] torchtext==0.18.0+cu121
[pip3] torchvision==0.19.0
[pip3] transformers==4.44.0
[pip3] triton==3.0.0
[conda] mkl                       2024.0.0         ha957f24_49657    conda-forge
[conda] mkl-include               2024.1.0           ha957f24_693    conda-forge
[conda] numpy                     1.26.4          py311h64a7726_0    conda-forge
[conda] nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
[conda] pyzmq                     26.1.0                   pypi_0    pypi
[conda] torch                     2.4.0                    pypi_0    pypi
[conda] torch-model-archiver      0.11.0                   pypi_0    pypi
[conda] torchaudio                2.3.0+cu121              pypi_0    pypi
[conda] torchserve                0.11.0                   pypi_0    pypi
[conda] torchtext                 0.18.0+cu121             pypi_0    pypi
[conda] torchvision               0.19.0                   pypi_0    pypi
[conda] transformers              4.44.0                   pypi_0    pypi
[conda] triton                    3.0.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.4@4db5176d9758b720b05460c50ace3c01026eb158
vLLM Build Flags:
CUDA Archs: 5.0 7.0+PTX 7.5+PTX 8.0 8.6 9.0; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0	GPU1	GPU2	GPU3	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	SYS	SYS	SYS	0-47	0		N/A
GPU1	SYS	 X 	SYS	SYS	0-47	0		N/A
GPU2	SYS	SYS	 X 	SYS	0-47	0		N/A
GPU3	SYS	SYS	SYS	 X 	0-47	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
```

</details>


### 🐛 Describe the bug

Trying to serve mixtral 7x8 awq model with 4 gpu node and wanted to publish two instances of that same model. in that node. My understanding from the documnets was we can use --tensor-parallel-size 2 --pipeline-parallel-size 2 to distribute these. My expectation was from this was that it will have two instances of this model using two gpu each but its throwing me this error.

(VllmWorkerProcess pid=138) INFO 08-13 15:22:51 model_runner.py:732] Loading model weights took 11.4953 GB
(VllmWorkerProcess pid=137) INFO 08-13 15:22:51 model_runner.py:732] Loading model weights took 11.4953 GB
(VllmWorkerProcess pid=139) INFO 08-13 15:22:52 model_runner.py:732] Loading model weights took 11.4953 GB
INFO 08-13 15:22:52 model_runner.py:732] Loading model weights took 11.4953 GB
(VllmWorkerProcess pid=139) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks: 'MixtralForCausalLM' object has no attribute 'make_empty_intermediate_tensors', Traceback (most recent call last):
(VllmWorkerProcess pid=139) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226]   File "/opt/conda/lib/python3.11/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=139) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=139) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226]              ^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=139) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226]   File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=139) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=139) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=139) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226]   File "/opt/conda/lib/python3.11/site-packages/vllm/worker/worker.py", line 179, in determine_num_available_blocks
(VllmWorkerProcess pid=139) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226]     self.model_runner.profile_run()
(VllmWorkerProcess pid=139) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226]   File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=139) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=139) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=139) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226]   File "/opt/conda/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 936, in profile_run
(VllmWorkerProcess pid=139) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226]     intermediate_tensors = self.model.make_empty_intermediate_tensors(
(VllmWorkerProcess pid=139) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=139) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226]   File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1729, in __getattr__
(VllmWorkerProcess pid=139) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226]     raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
(VllmWorkerProcess pid=139) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226] AttributeError: 'MixtralForCausalLM' object has no attribute 'make_empty_intermediate_tensors'
(VllmWorkerProcess pid=139) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226] 
(VllmWorkerProcess pid=138) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks: 'MixtralForCausalLM' object has no attribute 'make_empty_intermediate_tensors', Traceback (most recent call last):
(VllmWorkerProcess pid=138) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226]   File "/opt/conda/lib/python3.11/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=138) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=138) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226]              ^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=138) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226]   File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=138) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=138) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=138) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226]   File "/opt/conda/lib/python3.11/site-packages/vllm/worker/worker.py", line 179, in determine_num_available_blocks
(VllmWorkerProcess pid=138) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226]     self.model_runner.profile_run()
(VllmWorkerProcess pid=138) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226]   File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=138) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=138) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=138) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226]   File "/opt/conda/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 936, in profile_run
(VllmWorkerProcess pid=138) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226]     intermediate_tensors = self.model.make_empty_intermediate_tensors(
(VllmWorkerProcess pid=138) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=138) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226]   File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1729, in __getattr__
(VllmWorkerProcess pid=138) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226]     raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
(VllmWorkerProcess pid=138) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226] AttributeError: 'MixtralForCausalLM' object has no attribute 'make_empty_intermediate_tensors'
(VllmWorkerProcess pid=138) ERROR 08-13 15:22:52 multiproc_worker_utils.py:226] 
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks: list index out of range, Traceback (most recent call last):
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226]   File "/opt/conda/lib/python3.11/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226]              ^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226]   File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226]   File "/opt/conda/lib/python3.11/site-packages/vllm/worker/worker.py", line 179, in determine_num_available_blocks
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226]     self.model_runner.profile_run()
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226]   File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226]   File "/opt/conda/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 940, in profile_run
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226]     self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226]   File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226]   File "/opt/conda/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1363, in execute_model
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226]     hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226]                                     ^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226]   File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226]   File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226]   File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/mixtral_quant.py", line 361, in forward
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226]     hidden_states = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226]   File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226]   File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226]   File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/mixtral_quant.py", line 328, in forward
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226]     kv_caches[i], attn_metadata,
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226]     ~~~~~~~~~^^^
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226] IndexError: list index out of range
(VllmWorkerProcess pid=137) ERROR 08-13 15:23:05 multiproc_worker_utils.py:226] 
/opt/conda/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
ERROR 08-13 15:23:05 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 138 died, exit code: -15
INFO 08-13 15:23:05 multiproc_worker_utils.py:123] Killing local vLLM worker processes
Process Process-1:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.11/site-packages/vllm/entrypoints/openai/rpc/server.py", line 217, in run_rpc_server
    server = AsyncEngineRPCServer(async_engine_args, usage_context, port)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/vllm/entrypoints/openai/rpc/server.py", line 25, in __init__
    self.engine = AsyncLLMEngine.from_engine_args(async_engine_args,
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 471, in from_engine_args
    engine = cls(
             ^^^^
  File "/opt/conda/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 381, in __init__
    self.engine = self._init_engine(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 552, in _init_engine
    return engine_class(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 263, in __init__
    self._initialize_kv_caches()
  File "/opt/conda/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 362, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 38, in determine_num_available_blocks
    num_blocks = self._run_workers("determine_num_available_blocks", )
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers
    driver_worker_output = driver_worker_method(*args, **kwargs)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/vllm/worker/worker.py", line 179, in determine_num_available_blocks
    self.model_runner.profile_run()
  File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 940, in profile_run
    self.execute_model(model_input, kv_caches, intermediate_tensors)
  File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1363, in execute_model
    hidden_or_intermediate_states = model_executable(
                                    ^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/mixtral_quant.py", line 361, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/mixtral_quant.py", line 328, in forward
    kv_caches[i], attn_metadata,
    ~~~~~~~~~^^^
IndexError: list index out of range
/opt/conda/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: Model serving failed with these arguments --tensor-parallel-size 2 --pipeline-parallel-size 2 #7474

Your current environment

🐛 Describe the bug

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Model serving failed with these arguments --tensor-parallel-size 2 --pipeline-parallel-size 2 #7474

Description

Your current environment

🐛 Describe the bug

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions