Skip to content

[Bug] FP8 Model Loading Fails with "Expected torch::kInt8" #20052

@celsowm

Description

@celsowm

Your current environment

RTX 5090
vLLM API server version 0.9.2.dev209+g2dd24ebe1

🐛 Describe the bug

docker run --gpus all --rm -it -p 8000:8000 \
  -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  vllm:openai \
  --model RedHatAI/Mistral-Nemo-Instruct-2407-FP8 \
  --max-model-len 60000 \
  --gpu-memory-utilization 0.85
INFO 06-24 20:25:11 [__init__.py:244] Automatically detected platform cuda.
INFO 06-24 20:25:15 [api_server.py:1287] vLLM API server version 0.9.2.dev209+g2dd24ebe1
INFO 06-24 20:25:15 [cli_args.py:309] non-default args: {'model': 'RedHatAI/Mistral-Nemo-Instruct-2407-FP8', 'max_model_len': 60000, 'gpu_memory_utilization': 0.85}
config.json: 100%|█████████████████████████████████████████████████████████████████████| 822/822 [00:00<00:00, 3.01MB/s]
INFO 06-24 20:25:26 [config.py:831] This model supports multiple tasks: {'classify', 'score', 'reward', 'generate', 'embed'}. Defaulting to 'generate'.
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████| 178k/178k [00:00<00:00, 1.49MB/s]
INFO 06-24 20:25:28 [config.py:1444] Using max model len 60000
INFO 06-24 20:25:30 [config.py:2188] Chunked prefill is enabled with max_num_batched_tokens=2048.
vocab.json: 100%|██████████████████████████████████████████████████████████████████| 2.47M/2.47M [00:00<00:00, 3.49MB/s]
merges.txt: 100%|██████████████████████████████████████████████████████████████████| 3.13M/3.13M [00:00<00:00, 3.76MB/s]
tokenizer.json: 100%|██████████████████████████████████████████████████████████████| 9.26M/9.26M [00:00<00:00, 12.3MB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████| 414/414 [00:00<00:00, 1.94MB/s]
generation_config.json: 100%|███████████████████████████████████████████████████████████| 116/116 [00:00<00:00, 563kB/s]
INFO 06-24 20:25:36 [__init__.py:244] Automatically detected platform cuda.
INFO 06-24 20:25:37 [core.py:459] Waiting for init message from front-end.
INFO 06-24 20:25:37 [core.py:69] Initializing a V1 LLM engine (v0.9.2.dev209+g2dd24ebe1) with config: model='RedHatAI/Mistral-Nemo-Instruct-2407-FP8', speculative_config=None, tokenizer='RedHatAI/Mistral-Nemo-Instruct-2407-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=60000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=RedHatAI/Mistral-Nemo-Instruct-2407-FP8, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
WARNING 06-24 20:25:38 [utils.py:2753] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7eff6dea0d70>
INFO 06-24 20:25:38 [parallel_state.py:1072] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 06-24 20:25:38 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
INFO 06-24 20:25:38 [gpu_model_runner.py:1696] Starting to load model RedHatAI/Mistral-Nemo-Instruct-2407-FP8...
INFO 06-24 20:25:38 [gpu_model_runner.py:1701] Loading model from scratch...
INFO 06-24 20:25:39 [cuda.py:270] Using Flash Attention backend on V1 engine.
INFO 06-24 20:25:39 [weight_utils.py:292] Using model weights format ['*.safetensors']
model-00001-of-00003.safetensors: 100%|████████████████████████████████████████████| 4.94G/4.94G [00:51<00:00, 95.2MB/s]
model-00002-of-00003.safetensors: 100%|████████████████████████████████████████████| 4.98G/4.98G [00:55<00:00, 90.0MB/s]
model-00003-of-00003.safetensors: 100%|████████████████████████████████████████████| 3.67G/3.67G [00:39<00:00, 93.6MB/s]
INFO 06-24 20:28:07 [weight_utils.py:308] Time spent downloading weights for RedHatAI/Mistral-Nemo-Instruct-2407-FP8: 147.655946 seconds
model.safetensors.index.json: 100%|█████████████████████████████████████████████████| 78.4k/78.4k [00:00<00:00, 138MB/s]
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:06<00:13,  6.96s/it]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:07<00:03,  3.20s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:07<00:00,  1.91s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:07<00:00,  2.63s/it]

INFO 06-24 20:28:15 [default_loader.py:272] Loading weights took 7.91 seconds
INFO 06-24 20:28:15 [gpu_model_runner.py:1725] Model loading took 12.9014 GiB and 156.421439 seconds
INFO 06-24 20:28:24 [backends.py:508] Using cache directory: /root/.cache/vllm/torch_compile_cache/66d243d46e/rank_0_0/backbone for vLLM's torch.compile
INFO 06-24 20:28:24 [backends.py:519] Dynamo bytecode transform time: 8.63 s
INFO 06-24 20:28:26 [backends.py:181] Cache the graph of shape None for later use
INFO 06-24 20:28:42 [backends.py:193] Compiling a graph for general shape takes 17.64 s
ERROR 06-24 20:28:47 [core.py:519] EngineCore failed to start.
ERROR 06-24 20:28:47 [core.py:519] Traceback (most recent call last):
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 510, in run_engine_core
ERROR 06-24 20:28:47 [core.py:519]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 394, in __init__
ERROR 06-24 20:28:47 [core.py:519]     super().__init__(vllm_config, executor_class, log_stats,
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 82, in __init__
ERROR 06-24 20:28:47 [core.py:519]     self._initialize_kv_caches(vllm_config)
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 142, in _initialize_kv_caches
ERROR 06-24 20:28:47 [core.py:519]     available_gpu_memory = self.model_executor.determine_available_memory()
ERROR 06-24 20:28:47 [core.py:519]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 76, in determine_available_memory
ERROR 06-24 20:28:47 [core.py:519]     output = self.collective_rpc("determine_available_memory")
ERROR 06-24 20:28:47 [core.py:519]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
ERROR 06-24 20:28:47 [core.py:519]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 06-24 20:28:47 [core.py:519]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2687, in run_method
ERROR 06-24 20:28:47 [core.py:519]     return func(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 06-24 20:28:47 [core.py:519]     return func(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 210, in determine_available_memory
ERROR 06-24 20:28:47 [core.py:519]     self.model_runner.profile_run()
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2177, in profile_run
ERROR 06-24 20:28:47 [core.py:519]     = self._dummy_run(self.max_num_tokens)
ERROR 06-24 20:28:47 [core.py:519]       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 06-24 20:28:47 [core.py:519]     return func(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1969, in _dummy_run
ERROR 06-24 20:28:47 [core.py:519]     outputs = model(
ERROR 06-24 20:28:47 [core.py:519]               ^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-24 20:28:47 [core.py:519]     return self._call_impl(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 06-24 20:28:47 [core.py:519]     return forward_call(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 581, in forward
ERROR 06-24 20:28:47 [core.py:519]     model_output = self.model(input_ids, positions, intermediate_tensors,
ERROR 06-24 20:28:47 [core.py:519]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 239, in __call__
ERROR 06-24 20:28:47 [core.py:519]     output = self.compiled_callable(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
ERROR 06-24 20:28:47 [core.py:519]     return fn(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 368, in forward
ERROR 06-24 20:28:47 [core.py:519]     def forward(
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-24 20:28:47 [core.py:519]     return self._call_impl(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 06-24 20:28:47 [core.py:519]     return forward_call(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
ERROR 06-24 20:28:47 [core.py:519]     return fn(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 830, in call_wrapped
ERROR 06-24 20:28:47 [core.py:519]     return self._wrapped_call(self, *args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 406, in __call__
ERROR 06-24 20:28:47 [core.py:519]     raise e
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 393, in __call__
ERROR 06-24 20:28:47 [core.py:519]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-24 20:28:47 [core.py:519]     return self._call_impl(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 06-24 20:28:47 [core.py:519]     return forward_call(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "<eval_with_key>.82", line 570, in forward
ERROR 06-24 20:28:47 [core.py:519]     submod_0 = self.submod_0(l_input_ids_, s0, l_self_modules_embed_tokens_parameters_weight_, l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_input_scale_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_scale_, l_positions_, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_);  l_input_ids_ = l_self_modules_embed_tokens_parameters_weight_ = l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_input_scale_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_scale_ = None
ERROR 06-24 20:28:47 [core.py:519]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_piecewise_backend.py", line 112, in __call__
ERROR 06-24 20:28:47 [core.py:519]     return self.compiled_graph_for_general_shape(*args)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
ERROR 06-24 20:28:47 [core.py:519]     return fn(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/aot_autograd.py", line 1201, in forward
ERROR 06-24 20:28:47 [core.py:519]     return compiled_fn(full_args)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 328, in runtime_wrapper
ERROR 06-24 20:28:47 [core.py:519]     all_outs = call_func_at_runtime_with_args(
ERROR 06-24 20:28:47 [core.py:519]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/utils.py", line 126, in call_func_at_runtime_with_args
ERROR 06-24 20:28:47 [core.py:519]     out = normalize_as_list(f(args))
ERROR 06-24 20:28:47 [core.py:519]                             ^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 689, in inner_fn
ERROR 06-24 20:28:47 [core.py:519]     outs = compiled_fn(args)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 495, in wrapper
ERROR 06-24 20:28:47 [core.py:519]     return compiled_fn(runtime_args)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/output_code.py", line 460, in __call__
ERROR 06-24 20:28:47 [core.py:519]     return self.current_callable(inputs)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/utils.py", line 2404, in run
ERROR 06-24 20:28:47 [core.py:519]     return model(new_inputs)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/root/.cache/vllm/torch_compile_cache/66d243d46e/rank_0_0/inductor_cache/p5/cp5v2rsui6pemfcrv4y6tvjyqvla754oioaiqjrzcnhevfmigmxe.py", line 342, in call
ERROR 06-24 20:28:47 [core.py:519]     torch.ops._C.cutlass_scaled_mm.default(buf6, buf0, arg5_1, arg4_1, arg6_1, None)
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 756, in __call__
ERROR 06-24 20:28:47 [core.py:519]     return self._op(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519] RuntimeError: Expected a.dtype() == torch::kInt8 to be true, but got false.  (Could this error message be improved?  If so, please report an enhancement request to PyTorch.)
Process EngineCore_0:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 523, in run_engine_core
    raise e
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 510, in run_engine_core
    engine_core = EngineCoreProc(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 394, in __init__
    super().__init__(vllm_config, executor_class, log_stats,
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 82, in __init__
    self._initialize_kv_caches(vllm_config)
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 142, in _initialize_kv_caches
    available_gpu_memory = self.model_executor.determine_available_memory()
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 76, in determine_available_memory
    output = self.collective_rpc("determine_available_memory")
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2687, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 210, in determine_available_memory
    self.model_runner.profile_run()
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2177, in profile_run
    = self._dummy_run(self.max_num_tokens)
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1969, in _dummy_run
    outputs = model(
              ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 581, in forward
    model_output = self.model(input_ids, positions, intermediate_tensors,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 239, in __call__
    output = self.compiled_callable(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 368, in forward
    def forward(
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 830, in call_wrapped
    return self._wrapped_call(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 406, in __call__
    raise e
  File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 393, in __call__
    return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<eval_with_key>.82", line 570, in forward
    submod_0 = self.submod_0(l_input_ids_, s0, l_self_modules_embed_tokens_parameters_weight_, l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_input_scale_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_scale_, l_positions_, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_);  l_input_ids_ = l_self_modules_embed_tokens_parameters_weight_ = l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_input_scale_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_scale_ = None
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_piecewise_backend.py", line 112, in __call__
    return self.compiled_graph_for_general_shape(*args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/aot_autograd.py", line 1201, in forward
    return compiled_fn(full_args)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 328, in runtime_wrapper
    all_outs = call_func_at_runtime_with_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/utils.py", line 126, in call_func_at_runtime_with_args
    out = normalize_as_list(f(args))
                            ^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 689, in inner_fn
    outs = compiled_fn(args)
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 495, in wrapper
    return compiled_fn(runtime_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/output_code.py", line 460, in __call__
    return self.current_callable(inputs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/utils.py", line 2404, in run
    return model(new_inputs)
           ^^^^^^^^^^^^^^^^^
  File "/root/.cache/vllm/torch_compile_cache/66d243d46e/rank_0_0/inductor_cache/p5/cp5v2rsui6pemfcrv4y6tvjyqvla754oioaiqjrzcnhevfmigmxe.py", line 342, in call
    torch.ops._C.cutlass_scaled_mm.default(buf6, buf0, arg5_1, arg4_1, arg6_1, None)
  File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 756, in __call__
    return self._op(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected a.dtype() == torch::kInt8 to be true, but got false.  (Could this error message be improved?  If so, please report an enhancement request to PyTorch.)
[rank0]:[W624 20:28:47.081915375 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1387, in <module>
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1323, in run_server
    await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1343, in run_server_worker
    async with build_async_engine_client(args, client_config) as engine_client:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 155, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 191, in build_async_engine_client_from_engine_args
    async_llm = AsyncLLM.from_vllm_config(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 162, in from_vllm_config
    return cls(
           ^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 124, in __init__
    self.engine_core = EngineCoreClient.make_async_mp_client(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 93, in make_async_mp_client
    return AsyncMPClient(vllm_config, executor_class, log_stats,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 716, in __init__
    super().__init__(
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 422, in __init__
    self._init_engines_direct(vllm_config, local_only,
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 491, in _init_engines_direct
    self._wait_for_engine_startup(handshake_socket, input_address,
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 511, in _wait_for_engine_startup
    wait_for_engine_startup(
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/utils.py", line 494, in wait_for_engine_startup
    raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions