[Bug] FP8 Model Loading Fails with "Expected torch::kInt8"

### Your current environment

RTX 5090
vLLM API server version 0.9.2.dev209+g2dd24ebe1

### 🐛 Describe the bug

```
docker run --gpus all --rm -it -p 8000:8000 \
  -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  vllm:openai \
  --model RedHatAI/Mistral-Nemo-Instruct-2407-FP8 \
  --max-model-len 60000 \
  --gpu-memory-utilization 0.85
INFO 06-24 20:25:11 [__init__.py:244] Automatically detected platform cuda.
INFO 06-24 20:25:15 [api_server.py:1287] vLLM API server version 0.9.2.dev209+g2dd24ebe1
INFO 06-24 20:25:15 [cli_args.py:309] non-default args: {'model': 'RedHatAI/Mistral-Nemo-Instruct-2407-FP8', 'max_model_len': 60000, 'gpu_memory_utilization': 0.85}
config.json: 100%|█████████████████████████████████████████████████████████████████████| 822/822 [00:00<00:00, 3.01MB/s]
INFO 06-24 20:25:26 [config.py:831] This model supports multiple tasks: {'classify', 'score', 'reward', 'generate', 'embed'}. Defaulting to 'generate'.
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████| 178k/178k [00:00<00:00, 1.49MB/s]
INFO 06-24 20:25:28 [config.py:1444] Using max model len 60000
INFO 06-24 20:25:30 [config.py:2188] Chunked prefill is enabled with max_num_batched_tokens=2048.
vocab.json: 100%|██████████████████████████████████████████████████████████████████| 2.47M/2.47M [00:00<00:00, 3.49MB/s]
merges.txt: 100%|██████████████████████████████████████████████████████████████████| 3.13M/3.13M [00:00<00:00, 3.76MB/s]
tokenizer.json: 100%|██████████████████████████████████████████████████████████████| 9.26M/9.26M [00:00<00:00, 12.3MB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████| 414/414 [00:00<00:00, 1.94MB/s]
generation_config.json: 100%|███████████████████████████████████████████████████████████| 116/116 [00:00<00:00, 563kB/s]
INFO 06-24 20:25:36 [__init__.py:244] Automatically detected platform cuda.
INFO 06-24 20:25:37 [core.py:459] Waiting for init message from front-end.
INFO 06-24 20:25:37 [core.py:69] Initializing a V1 LLM engine (v0.9.2.dev209+g2dd24ebe1) with config: model='RedHatAI/Mistral-Nemo-Instruct-2407-FP8', speculative_config=None, tokenizer='RedHatAI/Mistral-Nemo-Instruct-2407-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=60000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=RedHatAI/Mistral-Nemo-Instruct-2407-FP8, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
WARNING 06-24 20:25:38 [utils.py:2753] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7eff6dea0d70>
INFO 06-24 20:25:38 [parallel_state.py:1072] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 06-24 20:25:38 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
INFO 06-24 20:25:38 [gpu_model_runner.py:1696] Starting to load model RedHatAI/Mistral-Nemo-Instruct-2407-FP8...
INFO 06-24 20:25:38 [gpu_model_runner.py:1701] Loading model from scratch...
INFO 06-24 20:25:39 [cuda.py:270] Using Flash Attention backend on V1 engine.
INFO 06-24 20:25:39 [weight_utils.py:292] Using model weights format ['*.safetensors']
model-00001-of-00003.safetensors: 100%|████████████████████████████████████████████| 4.94G/4.94G [00:51<00:00, 95.2MB/s]
model-00002-of-00003.safetensors: 100%|████████████████████████████████████████████| 4.98G/4.98G [00:55<00:00, 90.0MB/s]
model-00003-of-00003.safetensors: 100%|████████████████████████████████████████████| 3.67G/3.67G [00:39<00:00, 93.6MB/s]
INFO 06-24 20:28:07 [weight_utils.py:308] Time spent downloading weights for RedHatAI/Mistral-Nemo-Instruct-2407-FP8: 147.655946 seconds
model.safetensors.index.json: 100%|█████████████████████████████████████████████████| 78.4k/78.4k [00:00<00:00, 138MB/s]
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:06<00:13,  6.96s/it]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:07<00:03,  3.20s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:07<00:00,  1.91s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:07<00:00,  2.63s/it]

INFO 06-24 20:28:15 [default_loader.py:272] Loading weights took 7.91 seconds
INFO 06-24 20:28:15 [gpu_model_runner.py:1725] Model loading took 12.9014 GiB and 156.421439 seconds
INFO 06-24 20:28:24 [backends.py:508] Using cache directory: /root/.cache/vllm/torch_compile_cache/66d243d46e/rank_0_0/backbone for vLLM's torch.compile
INFO 06-24 20:28:24 [backends.py:519] Dynamo bytecode transform time: 8.63 s
INFO 06-24 20:28:26 [backends.py:181] Cache the graph of shape None for later use
INFO 06-24 20:28:42 [backends.py:193] Compiling a graph for general shape takes 17.64 s
ERROR 06-24 20:28:47 [core.py:519] EngineCore failed to start.
ERROR 06-24 20:28:47 [core.py:519] Traceback (most recent call last):
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 510, in run_engine_core
ERROR 06-24 20:28:47 [core.py:519]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 394, in __init__
ERROR 06-24 20:28:47 [core.py:519]     super().__init__(vllm_config, executor_class, log_stats,
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 82, in __init__
ERROR 06-24 20:28:47 [core.py:519]     self._initialize_kv_caches(vllm_config)
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 142, in _initialize_kv_caches
ERROR 06-24 20:28:47 [core.py:519]     available_gpu_memory = self.model_executor.determine_available_memory()
ERROR 06-24 20:28:47 [core.py:519]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 76, in determine_available_memory
ERROR 06-24 20:28:47 [core.py:519]     output = self.collective_rpc("determine_available_memory")
ERROR 06-24 20:28:47 [core.py:519]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
ERROR 06-24 20:28:47 [core.py:519]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 06-24 20:28:47 [core.py:519]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2687, in run_method
ERROR 06-24 20:28:47 [core.py:519]     return func(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 06-24 20:28:47 [core.py:519]     return func(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 210, in determine_available_memory
ERROR 06-24 20:28:47 [core.py:519]     self.model_runner.profile_run()
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2177, in profile_run
ERROR 06-24 20:28:47 [core.py:519]     = self._dummy_run(self.max_num_tokens)
ERROR 06-24 20:28:47 [core.py:519]       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 06-24 20:28:47 [core.py:519]     return func(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1969, in _dummy_run
ERROR 06-24 20:28:47 [core.py:519]     outputs = model(
ERROR 06-24 20:28:47 [core.py:519]               ^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-24 20:28:47 [core.py:519]     return self._call_impl(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 06-24 20:28:47 [core.py:519]     return forward_call(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 581, in forward
ERROR 06-24 20:28:47 [core.py:519]     model_output = self.model(input_ids, positions, intermediate_tensors,
ERROR 06-24 20:28:47 [core.py:519]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 239, in __call__
ERROR 06-24 20:28:47 [core.py:519]     output = self.compiled_callable(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
ERROR 06-24 20:28:47 [core.py:519]     return fn(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 368, in forward
ERROR 06-24 20:28:47 [core.py:519]     def forward(
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-24 20:28:47 [core.py:519]     return self._call_impl(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 06-24 20:28:47 [core.py:519]     return forward_call(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
ERROR 06-24 20:28:47 [core.py:519]     return fn(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 830, in call_wrapped
ERROR 06-24 20:28:47 [core.py:519]     return self._wrapped_call(self, *args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 406, in __call__
ERROR 06-24 20:28:47 [core.py:519]     raise e
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 393, in __call__
ERROR 06-24 20:28:47 [core.py:519]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-24 20:28:47 [core.py:519]     return self._call_impl(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 06-24 20:28:47 [core.py:519]     return forward_call(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "<eval_with_key>.82", line 570, in forward
ERROR 06-24 20:28:47 [core.py:519]     submod_0 = self.submod_0(l_input_ids_, s0, l_self_modules_embed_tokens_parameters_weight_, l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_input_scale_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_scale_, l_positions_, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_);  l_input_ids_ = l_self_modules_embed_tokens_parameters_weight_ = l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_input_scale_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_scale_ = None
ERROR 06-24 20:28:47 [core.py:519]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_piecewise_backend.py", line 112, in __call__
ERROR 06-24 20:28:47 [core.py:519]     return self.compiled_graph_for_general_shape(*args)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
ERROR 06-24 20:28:47 [core.py:519]     return fn(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/aot_autograd.py", line 1201, in forward
ERROR 06-24 20:28:47 [core.py:519]     return compiled_fn(full_args)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 328, in runtime_wrapper
ERROR 06-24 20:28:47 [core.py:519]     all_outs = call_func_at_runtime_with_args(
ERROR 06-24 20:28:47 [core.py:519]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/utils.py", line 126, in call_func_at_runtime_with_args
ERROR 06-24 20:28:47 [core.py:519]     out = normalize_as_list(f(args))
ERROR 06-24 20:28:47 [core.py:519]                             ^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 689, in inner_fn
ERROR 06-24 20:28:47 [core.py:519]     outs = compiled_fn(args)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 495, in wrapper
ERROR 06-24 20:28:47 [core.py:519]     return compiled_fn(runtime_args)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/output_code.py", line 460, in __call__
ERROR 06-24 20:28:47 [core.py:519]     return self.current_callable(inputs)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/utils.py", line 2404, in run
ERROR 06-24 20:28:47 [core.py:519]     return model(new_inputs)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519]   File "/root/.cache/vllm/torch_compile_cache/66d243d46e/rank_0_0/inductor_cache/p5/cp5v2rsui6pemfcrv4y6tvjyqvla754oioaiqjrzcnhevfmigmxe.py", line 342, in call
ERROR 06-24 20:28:47 [core.py:519]     torch.ops._C.cutlass_scaled_mm.default(buf6, buf0, arg5_1, arg4_1, arg6_1, None)
ERROR 06-24 20:28:47 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 756, in __call__
ERROR 06-24 20:28:47 [core.py:519]     return self._op(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519] RuntimeError: Expected a.dtype() == torch::kInt8 to be true, but got false.  (Could this error message be improved?  If so, please report an enhancement request to PyTorch.)
Process EngineCore_0:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 523, in run_engine_core
    raise e
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 510, in run_engine_core
    engine_core = EngineCoreProc(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 394, in __init__
    super().__init__(vllm_config, executor_class, log_stats,
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 82, in __init__
    self._initialize_kv_caches(vllm_config)
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 142, in _initialize_kv_caches
    available_gpu_memory = self.model_executor.determine_available_memory()
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 76, in determine_available_memory
    output = self.collective_rpc("determine_available_memory")
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2687, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 210, in determine_available_memory
    self.model_runner.profile_run()
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2177, in profile_run
    = self._dummy_run(self.max_num_tokens)
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1969, in _dummy_run
    outputs = model(
              ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 581, in forward
    model_output = self.model(input_ids, positions, intermediate_tensors,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 239, in __call__
    output = self.compiled_callable(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 368, in forward
    def forward(
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 830, in call_wrapped
    return self._wrapped_call(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 406, in __call__
    raise e
  File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 393, in __call__
    return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<eval_with_key>.82", line 570, in forward
    submod_0 = self.submod_0(l_input_ids_, s0, l_self_modules_embed_tokens_parameters_weight_, l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_input_scale_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_scale_, l_positions_, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_);  l_input_ids_ = l_self_modules_embed_tokens_parameters_weight_ = l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_input_scale_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_scale_ = None
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_piecewise_backend.py", line 112, in __call__
    return self.compiled_graph_for_general_shape(*args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/aot_autograd.py", line 1201, in forward
    return compiled_fn(full_args)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 328, in runtime_wrapper
    all_outs = call_func_at_runtime_with_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/utils.py", line 126, in call_func_at_runtime_with_args
    out = normalize_as_list(f(args))
                            ^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 689, in inner_fn
    outs = compiled_fn(args)
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 495, in wrapper
    return compiled_fn(runtime_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/output_code.py", line 460, in __call__
    return self.current_callable(inputs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/utils.py", line 2404, in run
    return model(new_inputs)
           ^^^^^^^^^^^^^^^^^
  File "/root/.cache/vllm/torch_compile_cache/66d243d46e/rank_0_0/inductor_cache/p5/cp5v2rsui6pemfcrv4y6tvjyqvla754oioaiqjrzcnhevfmigmxe.py", line 342, in call
    torch.ops._C.cutlass_scaled_mm.default(buf6, buf0, arg5_1, arg4_1, arg6_1, None)
  File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 756, in __call__
    return self._op(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected a.dtype() == torch::kInt8 to be true, but got false.  (Could this error message be improved?  If so, please report an enhancement request to PyTorch.)
[rank0]:[W624 20:28:47.081915375 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1387, in <module>
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1323, in run_server
    await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1343, in run_server_worker
    async with build_async_engine_client(args, client_config) as engine_client:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 155, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 191, in build_async_engine_client_from_engine_args
    async_llm = AsyncLLM.from_vllm_config(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 162, in from_vllm_config
    return cls(
           ^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 124, in __init__
    self.engine_core = EngineCoreClient.make_async_mp_client(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 93, in make_async_mp_client
    return AsyncMPClient(vllm_config, executor_class, log_stats,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 716, in __init__
    super().__init__(
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 422, in __init__
    self._init_engines_direct(vllm_config, local_only,
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 491, in _init_engines_direct
    self._wait_for_engine_startup(handshake_socket, input_address,
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 511, in _wait_for_engine_startup
    wait_for_engine_startup(
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/utils.py", line 494, in wait_for_engine_startup
    raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
```

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug] FP8 Model Loading Fails with "Expected torch::kInt8" #20052

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug] FP8 Model Loading Fails with "Expected torch::kInt8" #20052

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions