Skip to content

Commit bcb8076

Browse files
authored
vllm worker awq quantization update (#2463)
Co-authored-by: 董晓龙 <[email protected]>
1 parent a040cdc commit bcb8076

File tree

2 files changed

+7
-0
lines changed

2 files changed

+7
-0
lines changed

docs/vllm_integration.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,3 +18,8 @@ See the supported models [here](https://vllm.readthedocs.io/en/latest/models/sup
1818
```
1919
python3 -m fastchat.serve.vllm_worker --model-path lmsys/vicuna-7b-v1.3 --tokenizer hf-internal-testing/llama-tokenizer
2020
```
21+
22+
if you use a awq model, try
23+
'''
24+
python3 -m fastchat.serve.vllm_worker --model-path TheBloke/vicuna-7B-v1.5-AWQ --quantization awq
25+
'''

fastchat/serve/vllm_worker.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -210,6 +210,8 @@ async def api_model_details(request: Request):
210210
args.model = args.model_path
211211
if args.num_gpus > 1:
212212
args.tensor_parallel_size = args.num_gpus
213+
if args.quantizaiton:
214+
args.quantization = args.quantization
213215

214216
engine_args = AsyncEngineArgs.from_cli_args(args)
215217
engine = AsyncLLMEngine.from_engine_args(engine_args)

0 commit comments

Comments
 (0)