-
Notifications
You must be signed in to change notification settings - Fork 1.4k
[UPDATE] update mddocs/DockerGuides/vllm_docker_quickstart.md #12166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|:---|:---| | ||
|`model="YOUR_MODEL"`| the model path in docker, for example "/llm/models/Llama-2-7b-chat-hf"| | ||
|`load_in_low_bit="fp8"`| model quantization accuracy, acceptable `fp8`, `fp6`, `sym_int4`, default is `fp8`| | ||
|`tensor_parallel_size=1`| number of graphics cards used by the model, default is `1`| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also mention pp here? The explanation may not be entirely accurate and the official one is Number of tensor parallel replicas.
You could refer to docs here
-pp 2 | ||
``` | ||
|
||
3. **TP+PP Serving**: using tensor-parallel and pipline-parallel mixed, for example, using 2 cards for tp and 2 cards for pp serving, add following parameter: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be confusing. It's not a 2tp + 2tp
architecture but more like a 2 pp * (2 tp)
architecture. Could refer the description here: https://docs.vllm.ai/en/stable/serving/distributed_serving.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
### Quantization | ||
|
||
The accuracy of the quantitative model is reduced from FP16 to INT4, which effectively reduces the file size by about 70 %. The main advantage is lower delay and memory usage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reduces the file size by about 70 %
is about the model file size? In vLLM.entrypoint (LLM, openai.api_server, etc.) we are still using the origin model file instead of the ipex_llm converted file, the quantization convert occurs in CPU memory.
|
||
Two scripts are provided in the docker image for model inference. | ||
|
||
1. vllm offline inference: `vllm_offline_inference.py` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference here between Line 63?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 211 focuses the quantization, line 63 introduces the entire process of ipex offline inferece
Codegeex4-all-9b| 1 | ||
Llama-2-13B|2 | ||
Qwen1.5-14b|2 | ||
Baichuan2-13B|4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why Baichuan2-13B needs 4 gpus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, there is still a known issue for baichuan2 13b. The output is not correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
Update vllm docker quick start document with vllm 0.5.4 version.