-
-
Notifications
You must be signed in to change notification settings - Fork 9.4k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Your current environment
RTX 5090
vLLM API server version 0.9.2.dev209+g2dd24ebe1
🐛 Describe the bug
docker run --gpus all --rm -it -p 8000:8000 \
-e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
vllm:openai \
--model RedHatAI/Mistral-Nemo-Instruct-2407-FP8 \
--max-model-len 60000 \
--gpu-memory-utilization 0.85
INFO 06-24 20:25:11 [__init__.py:244] Automatically detected platform cuda.
INFO 06-24 20:25:15 [api_server.py:1287] vLLM API server version 0.9.2.dev209+g2dd24ebe1
INFO 06-24 20:25:15 [cli_args.py:309] non-default args: {'model': 'RedHatAI/Mistral-Nemo-Instruct-2407-FP8', 'max_model_len': 60000, 'gpu_memory_utilization': 0.85}
config.json: 100%|█████████████████████████████████████████████████████████████████████| 822/822 [00:00<00:00, 3.01MB/s]
INFO 06-24 20:25:26 [config.py:831] This model supports multiple tasks: {'classify', 'score', 'reward', 'generate', 'embed'}. Defaulting to 'generate'.
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████| 178k/178k [00:00<00:00, 1.49MB/s]
INFO 06-24 20:25:28 [config.py:1444] Using max model len 60000
INFO 06-24 20:25:30 [config.py:2188] Chunked prefill is enabled with max_num_batched_tokens=2048.
vocab.json: 100%|██████████████████████████████████████████████████████████████████| 2.47M/2.47M [00:00<00:00, 3.49MB/s]
merges.txt: 100%|██████████████████████████████████████████████████████████████████| 3.13M/3.13M [00:00<00:00, 3.76MB/s]
tokenizer.json: 100%|██████████████████████████████████████████████████████████████| 9.26M/9.26M [00:00<00:00, 12.3MB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████| 414/414 [00:00<00:00, 1.94MB/s]
generation_config.json: 100%|███████████████████████████████████████████████████████████| 116/116 [00:00<00:00, 563kB/s]
INFO 06-24 20:25:36 [__init__.py:244] Automatically detected platform cuda.
INFO 06-24 20:25:37 [core.py:459] Waiting for init message from front-end.
INFO 06-24 20:25:37 [core.py:69] Initializing a V1 LLM engine (v0.9.2.dev209+g2dd24ebe1) with config: model='RedHatAI/Mistral-Nemo-Instruct-2407-FP8', speculative_config=None, tokenizer='RedHatAI/Mistral-Nemo-Instruct-2407-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=60000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=RedHatAI/Mistral-Nemo-Instruct-2407-FP8, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
WARNING 06-24 20:25:38 [utils.py:2753] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7eff6dea0d70>
INFO 06-24 20:25:38 [parallel_state.py:1072] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 06-24 20:25:38 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
INFO 06-24 20:25:38 [gpu_model_runner.py:1696] Starting to load model RedHatAI/Mistral-Nemo-Instruct-2407-FP8...
INFO 06-24 20:25:38 [gpu_model_runner.py:1701] Loading model from scratch...
INFO 06-24 20:25:39 [cuda.py:270] Using Flash Attention backend on V1 engine.
INFO 06-24 20:25:39 [weight_utils.py:292] Using model weights format ['*.safetensors']
model-00001-of-00003.safetensors: 100%|████████████████████████████████████████████| 4.94G/4.94G [00:51<00:00, 95.2MB/s]
model-00002-of-00003.safetensors: 100%|████████████████████████████████████████████| 4.98G/4.98G [00:55<00:00, 90.0MB/s]
model-00003-of-00003.safetensors: 100%|████████████████████████████████████████████| 3.67G/3.67G [00:39<00:00, 93.6MB/s]
INFO 06-24 20:28:07 [weight_utils.py:308] Time spent downloading weights for RedHatAI/Mistral-Nemo-Instruct-2407-FP8: 147.655946 seconds
model.safetensors.index.json: 100%|█████████████████████████████████████████████████| 78.4k/78.4k [00:00<00:00, 138MB/s]
Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 33% Completed | 1/3 [00:06<00:13, 6.96s/it]
Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:07<00:03, 3.20s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:07<00:00, 1.91s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:07<00:00, 2.63s/it]
INFO 06-24 20:28:15 [default_loader.py:272] Loading weights took 7.91 seconds
INFO 06-24 20:28:15 [gpu_model_runner.py:1725] Model loading took 12.9014 GiB and 156.421439 seconds
INFO 06-24 20:28:24 [backends.py:508] Using cache directory: /root/.cache/vllm/torch_compile_cache/66d243d46e/rank_0_0/backbone for vLLM's torch.compile
INFO 06-24 20:28:24 [backends.py:519] Dynamo bytecode transform time: 8.63 s
INFO 06-24 20:28:26 [backends.py:181] Cache the graph of shape None for later use
INFO 06-24 20:28:42 [backends.py:193] Compiling a graph for general shape takes 17.64 s
ERROR 06-24 20:28:47 [core.py:519] EngineCore failed to start.
ERROR 06-24 20:28:47 [core.py:519] Traceback (most recent call last):
ERROR 06-24 20:28:47 [core.py:519] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 510, in run_engine_core
ERROR 06-24 20:28:47 [core.py:519] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 394, in __init__
ERROR 06-24 20:28:47 [core.py:519] super().__init__(vllm_config, executor_class, log_stats,
ERROR 06-24 20:28:47 [core.py:519] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 82, in __init__
ERROR 06-24 20:28:47 [core.py:519] self._initialize_kv_caches(vllm_config)
ERROR 06-24 20:28:47 [core.py:519] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 142, in _initialize_kv_caches
ERROR 06-24 20:28:47 [core.py:519] available_gpu_memory = self.model_executor.determine_available_memory()
ERROR 06-24 20:28:47 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 76, in determine_available_memory
ERROR 06-24 20:28:47 [core.py:519] output = self.collective_rpc("determine_available_memory")
ERROR 06-24 20:28:47 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
ERROR 06-24 20:28:47 [core.py:519] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 06-24 20:28:47 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519] File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2687, in run_method
ERROR 06-24 20:28:47 [core.py:519] return func(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 06-24 20:28:47 [core.py:519] return func(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 210, in determine_available_memory
ERROR 06-24 20:28:47 [core.py:519] self.model_runner.profile_run()
ERROR 06-24 20:28:47 [core.py:519] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2177, in profile_run
ERROR 06-24 20:28:47 [core.py:519] = self._dummy_run(self.max_num_tokens)
ERROR 06-24 20:28:47 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 06-24 20:28:47 [core.py:519] return func(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1969, in _dummy_run
ERROR 06-24 20:28:47 [core.py:519] outputs = model(
ERROR 06-24 20:28:47 [core.py:519] ^^^^^^
ERROR 06-24 20:28:47 [core.py:519] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-24 20:28:47 [core.py:519] return self._call_impl(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 06-24 20:28:47 [core.py:519] return forward_call(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 581, in forward
ERROR 06-24 20:28:47 [core.py:519] model_output = self.model(input_ids, positions, intermediate_tensors,
ERROR 06-24 20:28:47 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 239, in __call__
ERROR 06-24 20:28:47 [core.py:519] output = self.compiled_callable(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519] File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
ERROR 06-24 20:28:47 [core.py:519] return fn(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519] ^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 368, in forward
ERROR 06-24 20:28:47 [core.py:519] def forward(
ERROR 06-24 20:28:47 [core.py:519] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-24 20:28:47 [core.py:519] return self._call_impl(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 06-24 20:28:47 [core.py:519] return forward_call(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519] File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
ERROR 06-24 20:28:47 [core.py:519] return fn(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519] ^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 830, in call_wrapped
ERROR 06-24 20:28:47 [core.py:519] return self._wrapped_call(self, *args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 406, in __call__
ERROR 06-24 20:28:47 [core.py:519] raise e
ERROR 06-24 20:28:47 [core.py:519] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 393, in __call__
ERROR 06-24 20:28:47 [core.py:519] return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
ERROR 06-24 20:28:47 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-24 20:28:47 [core.py:519] return self._call_impl(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 06-24 20:28:47 [core.py:519] return forward_call(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519] File "<eval_with_key>.82", line 570, in forward
ERROR 06-24 20:28:47 [core.py:519] submod_0 = self.submod_0(l_input_ids_, s0, l_self_modules_embed_tokens_parameters_weight_, l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_input_scale_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_scale_, l_positions_, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_); l_input_ids_ = l_self_modules_embed_tokens_parameters_weight_ = l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_input_scale_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_scale_ = None
ERROR 06-24 20:28:47 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_piecewise_backend.py", line 112, in __call__
ERROR 06-24 20:28:47 [core.py:519] return self.compiled_graph_for_general_shape(*args)
ERROR 06-24 20:28:47 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519] File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
ERROR 06-24 20:28:47 [core.py:519] return fn(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519] ^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/aot_autograd.py", line 1201, in forward
ERROR 06-24 20:28:47 [core.py:519] return compiled_fn(full_args)
ERROR 06-24 20:28:47 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 328, in runtime_wrapper
ERROR 06-24 20:28:47 [core.py:519] all_outs = call_func_at_runtime_with_args(
ERROR 06-24 20:28:47 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/utils.py", line 126, in call_func_at_runtime_with_args
ERROR 06-24 20:28:47 [core.py:519] out = normalize_as_list(f(args))
ERROR 06-24 20:28:47 [core.py:519] ^^^^^^^
ERROR 06-24 20:28:47 [core.py:519] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 689, in inner_fn
ERROR 06-24 20:28:47 [core.py:519] outs = compiled_fn(args)
ERROR 06-24 20:28:47 [core.py:519] ^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 495, in wrapper
ERROR 06-24 20:28:47 [core.py:519] return compiled_fn(runtime_args)
ERROR 06-24 20:28:47 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/output_code.py", line 460, in __call__
ERROR 06-24 20:28:47 [core.py:519] return self.current_callable(inputs)
ERROR 06-24 20:28:47 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/utils.py", line 2404, in run
ERROR 06-24 20:28:47 [core.py:519] return model(new_inputs)
ERROR 06-24 20:28:47 [core.py:519] ^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519] File "/root/.cache/vllm/torch_compile_cache/66d243d46e/rank_0_0/inductor_cache/p5/cp5v2rsui6pemfcrv4y6tvjyqvla754oioaiqjrzcnhevfmigmxe.py", line 342, in call
ERROR 06-24 20:28:47 [core.py:519] torch.ops._C.cutlass_scaled_mm.default(buf6, buf0, arg5_1, arg4_1, arg6_1, None)
ERROR 06-24 20:28:47 [core.py:519] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 756, in __call__
ERROR 06-24 20:28:47 [core.py:519] return self._op(*args, **kwargs)
ERROR 06-24 20:28:47 [core.py:519] ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-24 20:28:47 [core.py:519] RuntimeError: Expected a.dtype() == torch::kInt8 to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)
Process EngineCore_0:
Traceback (most recent call last):
File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 523, in run_engine_core
raise e
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 510, in run_engine_core
engine_core = EngineCoreProc(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 394, in __init__
super().__init__(vllm_config, executor_class, log_stats,
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 82, in __init__
self._initialize_kv_caches(vllm_config)
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 142, in _initialize_kv_caches
available_gpu_memory = self.model_executor.determine_available_memory()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 76, in determine_available_memory
output = self.collective_rpc("determine_available_memory")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2687, in run_method
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 210, in determine_available_memory
self.model_runner.profile_run()
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2177, in profile_run
= self._dummy_run(self.max_num_tokens)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1969, in _dummy_run
outputs = model(
^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 581, in forward
model_output = self.model(input_ids, positions, intermediate_tensors,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 239, in __call__
output = self.compiled_callable(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 368, in forward
def forward(
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 830, in call_wrapped
return self._wrapped_call(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 406, in __call__
raise e
File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 393, in __call__
return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<eval_with_key>.82", line 570, in forward
submod_0 = self.submod_0(l_input_ids_, s0, l_self_modules_embed_tokens_parameters_weight_, l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_input_scale_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_scale_, l_positions_, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_); l_input_ids_ = l_self_modules_embed_tokens_parameters_weight_ = l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_input_scale_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_scale_ = None
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_piecewise_backend.py", line 112, in __call__
return self.compiled_graph_for_general_shape(*args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/aot_autograd.py", line 1201, in forward
return compiled_fn(full_args)
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 328, in runtime_wrapper
all_outs = call_func_at_runtime_with_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/utils.py", line 126, in call_func_at_runtime_with_args
out = normalize_as_list(f(args))
^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 689, in inner_fn
outs = compiled_fn(args)
^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 495, in wrapper
return compiled_fn(runtime_args)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/output_code.py", line 460, in __call__
return self.current_callable(inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/utils.py", line 2404, in run
return model(new_inputs)
^^^^^^^^^^^^^^^^^
File "/root/.cache/vllm/torch_compile_cache/66d243d46e/rank_0_0/inductor_cache/p5/cp5v2rsui6pemfcrv4y6tvjyqvla754oioaiqjrzcnhevfmigmxe.py", line 342, in call
torch.ops._C.cutlass_scaled_mm.default(buf6, buf0, arg5_1, arg4_1, arg6_1, None)
File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 756, in __call__
return self._op(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected a.dtype() == torch::kInt8 to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)
[rank0]:[W624 20:28:47.081915375 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1387, in <module>
uvloop.run(run_server(args))
File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
return __asyncio.run(
^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1323, in run_server
await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1343, in run_server_worker
async with build_async_engine_client(args, client_config) as engine_client:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 155, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 191, in build_async_engine_client_from_engine_args
async_llm = AsyncLLM.from_vllm_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 162, in from_vllm_config
return cls(
^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 124, in __init__
self.engine_core = EngineCoreClient.make_async_mp_client(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 93, in make_async_mp_client
return AsyncMPClient(vllm_config, executor_class, log_stats,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 716, in __init__
super().__init__(
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 422, in __init__
self._init_engines_direct(vllm_config, local_only,
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 491, in _init_engines_direct
self._wait_for_engine_startup(handshake_socket, input_address,
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 511, in _wait_for_engine_startup
wait_for_engine_startup(
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/utils.py", line 494, in wait_for_engine_startup
raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
ExtReMLapin
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working