Skip to content

[Bug]: Mixed GPTQ quant not working with gfx1100,gfx1201 - Qwen3-235B-A22B-Instruct-2507-GPTQ-Int4-Int8Mix #696

@djdeniro

Description

@djdeniro

Your current environment

Docker Compose yaml
services:
  vllm:
    tty: true
#    restart: unless-stopped
    ports:
      - 8007:8000
    image: rocm/vllm-dev:nightly_main_20250917
    shm_size: '256g'
    volumes:
     - /mnt/tb_disk/llm:/app/models
    devices:
      - /dev/kfd:/dev/kfd
      - /dev/dri:/dev/dri
      - /dev/mem:/dev/mem
    environment:
      - ROCM_VISIBLE_DEVICES=0,6,1,5,3,4,2,7
      - HIP_VISIBLE_DEVICES=0,6,1,5,3,4,2,7
      - VLLM_USE_V1=1
      - VLLM_CUSTOM_OPS=all
      - NCCL_DEBUG=ERROR
      - PYTORCH_ALLOC_CONF=expandable_segments:True
      - VLLM_ROCM_USE_AITER=0
      - VLLM_USE_AITER_TRITON_FUSED_SPLIT_QKV_ROPE=1
      - VLLM_USE_AITER_TRITON_FUSED_ADD_RMSNORM_PAD=1
      - VLLM_USE_AITER_TRITON_GEMM=1
      - VLLM_USE_AITER_UNIFIED_ATTENTION=1
      - VLLM_ROCM_USE_AITER_MHA=0
      - TRITON_HIP_PRESHUFFLE_SCALES=1
      - NCCL_P2P_DISABLE=1
      - SAFETENSORS_FAST_GPU=1
      - VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
      - PYTORCH_TUNABLEOP_ENABLED
      - VLLM_DISABLE_COMPILE_CACHE=1
      - HSA_NO_SCRATCH_RECLAIM=1
    command: |
      sh -c '
      vllm serve /app/models/models/vllm/Qwen3-235B-A22B-Instruct-2507-GPTQ-Int4-Int8Mix \
        --served-model-name Qwen3-Coder-30B-A3B-Instruct-GPTQ-Int8  \
        --gpu-memory-utilization 0.965 \
        --max-model-len 65536  \
        --tensor-parallel-size 8  \
        --enable-auto-tool-choice \
        --disable-log-requests \
        --tool-call-parser qwen3_coder \
        --max-num-seqs 8 \
        --swap-space 4 \
        --trust-remote-code
       '
volumes: {}

🐛 Describe the bug

gptq-int4 works well, but if model mixed, it not work with VLLM and docker.

 ✔ Container vllm-7-vllm-1  Recreated                                                                                                                  0.3s 
Attaching to vllm-1
INFO 09-18 17:26:29 [__init__.py:216] Automatically detected platform rocm.
WARNING 09-18 17:26:51 [__init__.py:1764] argument '--disable-log-requests' is deprecated and replaced with '--enable-log-requests'. This will be removed in v0.12.0.
(APIServer pid=7) INFO 09-18 17:26:51 [api_server.py:1814] vLLM API server version 0.10.2rc3.dev180+g2a4d6412e
(APIServer pid=7) INFO 09-18 17:26:51 [utils.py:328] non-default args: {'model_tag': '/app/models/models/vllm/Qwen3-235B-A22B-Instruct-2507-GPTQ-Int4-Int8Mix', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'model': '/app/models/models/vllm/Qwen3-235B-A22B-Instruct-2507-GPTQ-Int4-Int8Mix', 'trust_remote_code': True, 'max_model_len': 65536, 'served_model_name': ['Qwen3-Coder-30B-A3B-Instruct-GPTQ-Int8'], 'tensor_parallel_size': 8, 'gpu_memory_utilization': 0.965, 'max_num_seqs': 8}
(APIServer pid=7) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.

(APIServer pid=7) INFO 09-18 17:27:23 [__init__.py:707] Resolved architecture: Qwen3MoeForCausalLM
(APIServer pid=7) `torch_dtype` is deprecated! Use `dtype` instead!
(APIServer pid=7) INFO 09-18 17:27:23 [__init__.py:1766] Using max model len 65536
(APIServer pid=7) INFO 09-18 17:27:23 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 09-18 17:27:29 [__init__.py:216] Automatically detected platform rocm.
(EngineCore_DP0 pid=281) INFO 09-18 17:27:50 [core.py:648] Waiting for init message from front-end.
(EngineCore_DP0 pid=281) INFO 09-18 17:27:50 [core.py:75] Initializing a V1 LLM engine (v0.10.2rc3.dev180+g2a4d6412e) with config: model='/app/models/models/vllm/Qwen3-235B-A22B-Instruct-2507-GPTQ-Int4-Int8Mix', speculative_config=None, tokenizer='/app/models/models/vllm/Qwen3-235B-A22B-Instruct-2507-GPTQ-Int4-Int8Mix', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=65536, download_dir=None, load_format=auto, tensor_parallel_size=8, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen3-Coder-30B-A3B-Instruct-GPTQ-Int8, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":16,"local_cache_dir":null}
(EngineCore_DP0 pid=281) WARNING 09-18 17:27:50 [multiproc_worker_utils.py:273] Reducing Torch parallelism from 128 threads to 1 to avoid unnecess(EngineCore_DP0 pid=281) INFO 09-18 17:27:50 [core.py:75] Initializing a V1 LLM engine (v0.10.2rc3.dev180+g2a4d6412e) with config: model='/app/models/models/vllm/Qwen3-235B-A22B-Instruct-2507-GPTQ-Int4-Int8Mix', speculative_config=None, tokenizer='/app/models/models/vllm/Qwen3-235B-A22B-Instruct-2507-GPTQ-Int4-Int8Mix', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=65536, download_dir=None, load_format=auto, tensor_parallel_size=8, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen3-Coder-30B-A3B-Instruct-GPTQ-Int8, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":16,"local_cache_dir":null}
(EngineCore_DP0 pid=281) INFO 09-18 17:27:50 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3, 4, 5, 6, 7], buffer_handle=(8, 16777216, 10, 'psm_1eee8047'), local_subscribe_addr='ipc:///tmp/31218087-e190-494b-99f1-e648dcdd3640', remote_subscribe_addr=None, remote_addr_ipv6=False)
(EngineCore_DP0 pid=281) WARNING 09-18 17:27:50 [multiproc_worker_utils.py:273] Reducing Torch parallelism from 128 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 09-18 17:27:55 [__init__.py:216] Automatically detected platform rocm.
INFO 09-18 17:28:17 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_3891382a'), local_subscribe_addr='ipc:///tmp/57528fd2-f248-448b-8663-2ed5e57cceb9', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-18 17:28:17 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_d6b41d99'), local_subscribe_addr='ipc:///tmp/72e938d8-ecef-40fb-98be-b8ec37c6a953', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-18 17:28:17 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_ac6d0a6a'), local_subscribe_addr='ipc:///tmp/95dea33a-0b6c-45d6-8e14-5e5b92d322b4', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-18 17:28:17 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_8dcfe175'), local_subscribe_addr='ipc:///tmp/dc82cf23-3291-46e2-a3ef-cafef17ee764', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-18 17:28:17 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_a8c47ca3'), local_subscribe_addr='ipc:///tmp/004a6afc-84b6-4811-846a-0b5f7c99d48b', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-18 17:28:17 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_c81f9a48'), local_subscribe_addr='ipc:///tmp/08be96b6-4d62-42c8-a48a-40fc1d2c7662', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-18 17:28:17 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_b34f8ba8'), local_subscribe_addr='ipc:///tmp/aa406cb3-bcd1-4ce9-bcf1-85f84cd0c0c6', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-18 17:28:17 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_af8f8b2b'), local_subscribe_addr='ipc:///tmp/f2b0edc8-a99e-4315-9b24-c08c8b306d1b', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-18 17:28:18 [__init__.py:1439] Found nccl from library librccl.so.1
INFO 09-18 17:28:18 [pynccl.py:70] vLLM is using nccl==2.22.3
INFO 09-18 17:28:21 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer_handle=(7, 4194304, 6, 'psm_118bae48'), local_subscribe_addr='ipc:///tmp/0baccec2-b9f5-4316-9236-f9ce1c657ac4', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-18 17:28:21 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer_handle=(7, 4194304, 6, 'psm_118bae48'), local_subscribe_addr='ipc:///tmp/0baccec2-b9f5-4316-9236-f9ce1c657ac4', remote_subscribe_addr=None, remote_addr_ipv6=False) Enable Watch
INFO 09-18 17:28:21 [parallel_state.py:1206] rank 0 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 09-18 17:28:21 [parallel_state.py:1206] rank 1 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
INFO 09-18 17:28:21 [parallel_state.py:1206] rank 4 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 4, EP rank 4
INFO 09-18 17:28:21 [parallel_state.py:1206] rank 2 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 2, EP rank 2
INFO 09-18 17:28:21 [parallel_state.py:1206] rank 3 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 3, EP rank 3
INFO 09-18 17:28:21 [parallel_state.py:1206] rank 5 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 5, EP rank 5
INFO 09-18 17:28:21 [parallel_state.py:1206] rank 6 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 6, EP rank 6
INFO 09-18 17:28:21 [parallel_state.py:1206] rank 7 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 7, EP rank 7
(Worker_TP4 pid=421) INFO 09-18 17:28:21 [gpu_model_runner.py:2450] Starting to load model /app/models/models/vllm/Qwen3-235B-A22B-Instruct-2507-GPTQ-Int4-Int8Mix...
(Worker_TP0 pid=417) INFO 09-18 17:28:21 [gpu_model_runner.py:2450] Starting to load model /app/models/models/vllm/Qwen3-235B-A22B-Instruct-2507-GPTQ-Int4-Int8Mix...
(Worker_TP1 pid=418) INFO 09-18 17:28:21 [gpu_model_runner.py:2450] Starting to load model /app/models/models/vllm/Qwen3-235B-A22B-Instruct-2507-GPTQ-Int4-Int8Mix...
(Worker_TP7 pid=424) INFO 09-18 17:28:21 [gpu_model_runner.py:2450] Starting to load model /app/models/models/vllm/Qwen3-235B-A22B-Instruct-2507-GPTQ-Int4-Int8Mix...
(Worker_TP3 pid=420) INFO 09-18 17:28:21 [gpu_model_runner.py:2450] Starting to load model /app/models/models/vllm/Qwen3-235B-A22B-Instruct-2507-GPTQ-Int4-Int8Mix...
(Worker_TP2 pid=419) INFO 09-18 17:28:21 [gpu_model_runner.py:2450] Starting to load model /app/models/models/vllm/Qwen3-235B-A22B-Instruct-2507-GPTQ-Int4-Int8Mix...
(Worker_TP5 pid=422) INFO 09-18 17:28:21 [gpu_model_runner.py:2450] Starting to load model /app/models/models/vllm/Qwen3-235B-A22B-Instruct-2507-GPTQ-Int4-Int8Mix...
(Worker_TP6 pid=423) INFO 09-18 17:28:21 [gpu_model_runner.py:2450] Starting to load model /app/models/models/vllm/Qwen3-235B-A22B-Instruct-2507-GPTQ-Int4-Int8Mix...
(Worker_TP4 pid=421) INFO 09-18 17:28:21 [gpu_model_runner.py:2482] Loading model from scratch...
(Worker_TP7 pid=424) INFO 09-18 17:28:21 [gpu_model_runner.py:2482] Loading model from scratch...
(Worker_TP3 pid=420) INFO 09-18 17:28:21 [gpu_model_runner.py:2482] Loading model from scratch...
(Worker_TP2 pid=419) INFO 09-18 17:28:21 [gpu_model_runner.py:2482] Loading model from scratch...
(Worker_TP0 pid=417) INFO 09-18 17:28:21 [gpu_model_runner.py:2482] Loading model from scratch...
(Worker_TP1 pid=418) INFO 09-18 17:28:21 [gpu_model_runner.py:2482] Loading model from scratch...
(Worker_TP5 pid=422) INFO 09-18 17:28:21 [gpu_model_runner.py:2482] Loading model from scratch...
(Worker_TP6 pid=423) INFO 09-18 17:28:21 [gpu_model_runner.py:2482] Loading model from scratch...
(Worker_TP4 pid=421) INFO 09-18 17:28:21 [rocm.py:245] Using Triton Attention backend on V1 engine.
(Worker_TP4 pid=421) INFO 09-18 17:28:21 [triton_attn.py:266] Using vllm unified attention for TritonAttentionImpl
(Worker_TP7 pid=424) INFO 09-18 17:28:21 [rocm.py:245] Using Triton Attention backend on V1 engine.
(Worker_TP7 pid=424) INFO 09-18 17:28:21 [triton_attn.py:266] Using vllm unified attention for TritonAttentionImpl
(Worker_TP3 pid=420) INFO 09-18 17:28:21 [rocm.py:245] Using Triton Attention backend on V1 engine.
(Worker_TP3 pid=420) INFO 09-18 17:28:21 [triton_attn.py:266] Using vllm unified attention for TritonAttentionImpl
(Worker_TP2 pid=419) INFO 09-18 17:28:21 [rocm.py:245] Using Triton Attention backend on V1 engine.
(Worker_TP2 pid=419) INFO 09-18 17:28:21 [triton_attn.py:266] Using vllm unified attention for TritonAttentionImpl
(Worker_TP5 pid=422) INFO 09-18 17:28:21 [rocm.py:245] Using Triton Attention backend on V1 engine.
(Worker_TP5 pid=422) INFO 09-18 17:28:21 [triton_attn.py:266] Using vllm unified attention for TritonAttentionImpl
(Worker_TP1 pid=418) INFO 09-18 17:28:21 [rocm.py:245] Using Triton Attention backend on V1 engine.
(Worker_TP1 pid=418) INFO 09-18 17:28:21 [triton_attn.py:266] Using vllm unified attention for TritonAttentionImpl
(Worker_TP6 pid=423) INFO 09-18 17:28:21 [rocm.py:245] Using Triton Attention backend on V1 engine.
(Worker_TP6 pid=423) INFO 09-18 17:28:21 [triton_attn.py:266] Using vllm unified attention for TritonAttentionImpl
(Worker_TP0 pid=417) INFO 09-18 17:28:21 [rocm.py:245] Using Triton Attention backend on V1 engine.
(Worker_TP0 pid=417) INFO 09-18 17:28:21 [triton_attn.py:266] Using vllm unified attention for TritonAttentionImpl
Loading safetensors checkpoint shards:   7% 2/27 [00:22<04:46, 11.48s/it](Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597] WorkerProc failed to start.
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597] Traceback (most recent call last):
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 571, in worker_main
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     worker = WorkerProc(*args, **kwargs)
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 437, in __init__
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     self.worker.load_model()
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 214, in load_model
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2483, in load_model
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     self.model = model_loader.load_model(
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 50, in load_model
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     self.load_weights(model, model_config)
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 265, in load_weights
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597] WorkerProc failed to start.
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     loaded_weights = model.load_weights(
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]                      ^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597] Traceback (most recent call last):
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_moe.py", line 702, in load_weights
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 571, in worker_main
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     return loader.load_weights(weights)
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     worker = WorkerProc(*args, **kwargs)
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 291, in load_weights
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 437, in __init__
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     autoloaded_weights = set(self._load_module("", self.module, weights))
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     self.worker.load_model()
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 214, in load_model
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 249, in _load_module
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     yield from self._load_module(prefix,
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2483, in load_model
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 222, in _load_module
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     self.model = model_loader.load_model(
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     loaded_params = module_load_weights(weights)
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 50, in load_model
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_moe.py", line 538, in load_weights
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     self.load_weights(model, model_config)
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     success = weight_loader(param,
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_lo(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597] WorkerProc failed to start.
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     loaded_weights = model.load_weights(
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/moe_wna16.py", line 466, in moe_wna16_weight_loader
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597] WorkerProc failed to start.
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597] Traceback (most recent call last):
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]                      ^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     return weight_loader(param,
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597] Traceback (most recent call last):
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 571, in worker_main
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_moe.py", line 702, in load_weights
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]            ^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 571, in worker_main
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     worker = WorkerProc(*args, **kwargs)
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     return loader.load_weights(weights)
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 1451, in weight_loader
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     worker = WorkerProc(*args, **kwargs)
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     self._load_model_weight_or_group_weight_scale(
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 437, in __init__
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 437, in __init__
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 291, in load_weights
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 1128, in _load_model_weight_or_group_weight_scale
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     self.worker.load_model()
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     self.worker.load_model()
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     autoloaded_weights = set(self._load_module("", self.module, weights))
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     self._load_w13(shard_id=shard_id,
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 214, in load_model
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 214, in load_model
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 1171, in _load_w13
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 249, in _load_module
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     expert_data.copy_(loaded_weight)
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2483, in load_model
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2483, in load_model
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     yield from self._load_module(prefix,
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597] RuntimeError: The size of tensor a (2048) must match the size of tensor b (4096) at non-singleton dimension 1
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     self.model = model_loader.load_model(
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     self.model = model_loader.load_model(
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 222, in _load_module
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     loaded_params = module_load_weights(weights)
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 50, in load_model
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 50, in load_model
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     self.load_weights(model, model_config)
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     self.load_weights(model, model_config)
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_moe.py", line 538, in load_weights
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 265, in load_weights
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 265, in load_weights
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     success = weight_loader(param,
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     loaded_weights = model.load_weights(
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     loaded_weights = model.load_weights(
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]               ^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]                      ^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]                      ^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/moe_wna16.py", line 466, in moe_wna16_weight_loader
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_moe.py", line 702, in load_weights
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_moe.py", line 702, in load_weights
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     return loader.load_weights(weights)
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     return weight_loader(param,
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     return loader.load_weights(weights)
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]            ^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 291, in load_weights
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/u
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 1451, in weight_loader
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     autoloaded_weights = set(self._load_module("", self.module, weights))
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     autoloaded_weights = set(self._load_module("", self.module, weights))
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     self._load_model_weight_or_group_weight_scale(
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 1128, in _load_model_weight_or_group_weight_scale
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 249, in _load_module
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 249, in _load_module
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     self._load_w13(shard_id=shard_id,
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     yield from self._load_module(prefix,
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     yield from self._load_module(prefix,
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 1171, in _load_w13
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 222, in _load_module
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 222, in _load_module
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     expert_data.copy_(loaded_weight)
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     loaded_params = module_load_weights(weights)
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     loaded_params = module_load_weights(weights)
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597] RuntimeError: The size of tensor a (2048) must match the size of tensor b (4096) at non-singleton dimension 1
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_moe.py", line 538, in load_weights
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_moe.py", line 538, in load_weights
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     success = weight_loader(param,
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     success = weight_loader(param,
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]               ^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]               ^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/moe_wna16.py", line 466, in moe_wna16_weight_loader
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/moe_wna16.py", line 466, in moe_wna16_weight_loader
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     return weight_loader(param,
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     return weight_loader(param,
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]            ^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]            ^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 1451, in weight_loader
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 1451, in weight_loader
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     self._load_model_weight_or_group_weight_scale(
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     self._load_model_weight_or_group_weight_scale(
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 1128, in _load_model_weight_or_group_weight_scale
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 1128, in _load_model_weight_or_group_weight_scale
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     self._load_w13(shard_id=shard_id,
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     self._load_w13(shard_id=shard_id,
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 1171, in _load_w13
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 1171, in _load_w13
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     expert_data.copy_(loaded_weight)
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]     expert_data.copy_(loaded_weight)
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597] RuntimeError: The size of tensor a (2048) must match the size of tensor b (4096) at non-singleton dimension 1
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597] RuntimeError: The size of tensor a (2048) must match the size of tensor b (4096) at non-singleton dimension 1
Loading safetensors checkpoint shards:   7% 2/27 [00:24<05:02, 12.10s/it]
(Worker_TP2 pid=419) INFO 09-18 17:28:46 [multiproc_executor.py:558] Parent process exited, terminating worker
(Worker_TP1 pid=418) INFO 09-18 17:28:46 [multiproc_executor.py:558] Parent process exited, terminating worker
(Worker_TP3 pid=420) INFO 09-18 17:28:46 [multiproc_executor.py:558] Parent process exited, terminating worker
(Worker_TP0 pid=417) INFO 09-18 17:28:46 [multiproc_executor.py:558] Parent process exited, terminating worker
(Worker_TP6 pid=423) INFO 09-18 17:28:46 [multiproc_executor.py:558] Parent process exited, terminating worker
(Worker_TP0 pid=417) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 265, in load_weights
(Worker_TP3 pid=420) ERROR 09-18 17:28:46 [multiproc_executor.py:597]               ^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=418) ERROR 09-18 17:28:46 [multiproc_executor.py:597] WorkerProc failed to start.
(Worker_TP2 pid=419) ERROR 09-18 17:28:46 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 291, in load_weights
(Worker_TP4 pid=421) INFO 09-18 17:28:46 [multiproc_executor.py:558] Parent process exited, terminating worker
(Worker_TP7 pid=424) INFO 09-18 17:28:46 [multiproc_executor.py:558] Parent process exited, terminating worker
(Worker_TP5 pid=422) INFO 09-18 17:28:46 [multiproc_executor.py:558] Parent process exited, terminating worker
[rank0]:[W918 17:28:46.977770380 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(EngineCore_DP0 pid=281) ERROR 09-18 17:28:50 [core.py:712] EngineCore failed to start.
(EngineCore_DP0 pid=281) ERROR 09-18 17:28:50 [core.py:712] Traceback (most recent call last):
(EngineCore_DP0 pid=281) ERROR 09-18 17:28:50 [core.py:712]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 703, in run_engine_core
(EngineCore_DP0 pid=281) ERROR 09-18 17:28:50 [core.py:712]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=281) ERROR 09-18 17:28:50 [core.py:712]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=281) ERROR 09-18 17:28:50 [core.py:712]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 502, in __init__
(EngineCore_DP0 pid=281) ERROR 09-18 17:28:50 [core.py:712]     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=281) ERROR 09-18 17:28:50 [core.py:712]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 81, in __init__
(EngineCore_DP0 pid=281) ERROR 09-18 17:28:50 [core.py:712]     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=281) ERROR 09-18 17:28:50 [core.py:712]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=281) ERROR 09-18 17:28:50 [core.py:712]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 55, in __init__
(EngineCore_DP0 pid=281) ERROR 09-18 17:28:50 [core.py:712]     self._init_executor()
(EngineCore_DP0 pid=281) ERROR 09-18 17:28:50 [core.py:712]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 106, in _init_executor
(EngineCore_DP0 pid=281) ERROR 09-18 17:28:50 [core.py:712]     self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=281) ERROR 09-18 17:28:50 [core.py:712]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=281) ERROR 09-18 17:28:50 [core.py:712]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 509, in wait_for_ready
(EngineCore_DP0 pid=281) ERROR 09-18 17:28:50 [core.py:712]     raise e from None
(EngineCore_DP0 pid=281) ERROR 09-18 17:28:50 [core.py:712] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(EngineCore_DP0 pid=281) Process EngineCore_DP0:
(EngineCore_DP0 pid=281) Traceback (most recent call last):
(EngineCore_DP0 pid=281)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=281)     self.run()
(EngineCore_DP0 pid=281)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=281)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=281)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 716, in run_engine_core
(EngineCore_DP0 pid=281)     raise e
(EngineCore_DP0 pid=281)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 703, in run_engine_core
(EngineCore_DP0 pid=281)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=281)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=281)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 502, in __init__
(EngineCore_DP0 pid=281)     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=281)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 81, in __init__
(EngineCore_DP0 pid=281)     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=281)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=281)   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 55, in __init__
(EngineCore_DP0 pid=281)     self._init_executor()
(EngineCore_DP0 pid=281)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 106, in _init_executor
(EngineCore_DP0 pid=281)     self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=281)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=281)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 509, in wait_for_ready
(EngineCore_DP0 pid=281)     raise e from None
(EngineCore_DP0 pid=281) Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(APIServer pid=7) Traceback (most recent call last):
(APIServer pid=7)   File "/usr/local/bin/vllm", line 7, in <module>
(APIServer pid=7)     sys.exit(main())
(APIServer pid=7)              ^^^^^^
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 54, in main
(APIServer pid=7)     args.dispatch_function(args)
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 50, in cmd
(APIServer pid=7)     uvloop.run(run_server(args))
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
(APIServer pid=7)     return __asyncio.run(
(APIServer pid=7)            ^^^^^^^^^^^^^^
(APIServer pid=7)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=7)     return runner.run(main)
(APIServer pid=7)            ^^^^^^^^^^^^^^^^
(APIServer pid=7)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=7)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=7)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
(APIServer pid=7)     return await main
(APIServer pid=7)            ^^^^^^^^^^
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1859, in run_server
(APIServer pid=7)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1879, in run_server_worker
(APIServer pid=7)     async with build_async_engine_client(
(APIServer pid=7)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=7)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=7)     return await anext(self.gen)
(APIServer pid=7)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=7)     return runner.run(main)on3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 174, in build_async_engine_client
(APIServer pid=7)     async with build_async_engine_client_from_engine_args(
(APIServer pid=7)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 216, in build_async_engine_client_from_engine_args
(APIServer pid=7)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=7)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 1595, in inner
(APIServer pid=7)     return fn(*args, **kwargs)
(APIServer pid=7)            ^^^^^^^^^^^^^^^^^^^
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 207, in from_vllm_config
(APIServer pid=7)     return cls(
(APIServer pid=7)            ^^^^
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 134, in __init__
(APIServer pid=7)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=7)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 102, in make_async_mp_client
(APIServer pid=7)     return AsyncMPClient(*client_args)
(APIServer pid=7)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 769, in __init__
(APIServer pid=7)     super().__init__(
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 448, in __init__
(APIServer pid=7)     with launch_core_engines(vllm_config, executor_class,
(APIServer pid=7)          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=7)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=7)     next(self.gen)
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 732, in launch_core_engines
(APIServer pid=7)     wait_for_engine_startup(
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 785, in wait_for_engine_startup
(APIServer pid=7)     return self._loop.run_until_complete(task)
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 174, in build_async_engine_client
(APIServer pid=7)     raise RuntimeError("Engine core initialization failed. "
(APIServer pid=7) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
vllm-1 exited with code 1

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingrocm

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions