-
-
Notifications
You must be signed in to change notification settings - Fork 9.7k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
I build the newest master branch with #2279 commit.
And I run the following command
python -m vllm.entrypoints.openai.api_server --model ./Mistral-7B-Instruct-v0.2-AWQ --quantization awq --dtype auto --host 0.0.0.0 --port 8081 --tensor-parallel-size 2
I meet the error:
INFO 01-29 09:41:47 api_server.py:209] args: Namespace(host='0.0.0.0', port=8081, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, root_path=None, middleware=[], model='./Mistral-7B-Instruct-v0.2-AWQ', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization='awq', enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
WARNING 01-29 09:41:47 config.py:177] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-01-29 09:41:49,090 INFO worker.py:1724 -- Started a local Ray instance.
INFO 01-29 09:41:50 llm_engine.py:72] Initializing an LLM engine with config: model='./Mistral-7B-Instruct-v0.2-AWQ', tokenizer='./Mistral-7B-Instruct-v0.2-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, seed=0)
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/home/my/vllm/vllm/entrypoints/openai/api_server.py", line 217, in <module>
engine = AsyncLLMEngine.from_engine_args(engine_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/my/vllm/vllm/engine/async_llm_engine.py", line 615, in from_engine_args
engine = cls(parallel_config.worker_use_ray,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/my/vllm/vllm/engine/async_llm_engine.py", line 319, in __init__
self.engine = self._init_engine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/my/vllm/vllm/engine/async_llm_engine.py", line 364, in _init_engine
return engine_class(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/my/vllm/vllm/engine/llm_engine.py", line 109, in __init__
self._init_workers_ray(placement_group)
File "/home/my/vllm/vllm/engine/llm_engine.py", line 260, in _init_workers_ray
self.driver_worker = Worker(
^^^^^^^
TypeError: Worker.__init__() got an unexpected keyword argument 'cache_config'
2024-01-29 09:41:54,676 ERROR worker.py:405 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RayWorkerVllm.init_worker() (pid=3160378, ip=10.20.4.57, actor_id=ca7bf2aa56e3f1a0c1a7678201000000, repr=<vllm.engine.ray_utils.RayWorkerVllm object at 0x7f043133e7d0>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/my/vllm/vllm/engine/ray_utils.py", line 23, in init_worker
self.worker = worker_init_fn()
^^^^^^^^^^^^^^^^
File "/home/my/vllm/vllm/engine/llm_engine.py", line 247, in <lambda>
lambda rank=rank, local_rank=local_rank: Worker(
^^^^^^^
TypeError: Worker.__init__() got an unexpected keyword argument 'cache_config'
I am running with python=3.11
, CUDA 12.1
, driver 530
with 2x RTX 3090 NVLink.
I notice there is a discussion (#2279 (comment)) about cache_config
, I am not sure whether it is related
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working