[UPDATE] update mddocs/DockerGuides/vllm_docker_quickstart.md #12166

ACupofAir · 2024-10-09T01:28:49Z

Description

Update vllm docker quick start document with vllm 0.5.4 version.

xiangyuT · 2024-10-09T01:34:28Z

docs/mddocs/DockerGuides/vllm_docker_quickstart.md

+    |:---|:---|
+    |`model="YOUR_MODEL"`| the model path in docker, for example "/llm/models/Llama-2-7b-chat-hf"|
+    |`load_in_low_bit="fp8"`| model quantization accuracy, acceptable `fp8`, `fp6`, `sym_int4`, default is `fp8`|
+    |`tensor_parallel_size=1`| number of graphics cards used by the model, default is `1`|


Should we also mention pp here? The explanation may not be entirely accurate and the official one is Number of tensor parallel replicas. You could refer to docs here

xiangyuT · 2024-10-09T01:42:00Z

docs/mddocs/DockerGuides/vllm_docker_quickstart.md

+-pp 2
+```
+
+3. **TP+PP Serving**: using tensor-parallel and pipline-parallel mixed, for example, using 2 cards for tp and 2 cards for pp serving, add following parameter:


This may be confusing. It's not a 2tp + 2tp architecture but more like a 2 pp * (2 tp) architecture. Could refer the description here: https://docs.vllm.ai/en/stable/serving/distributed_serving.html

xiangyuT · 2024-10-09T01:45:59Z

docs/mddocs/DockerGuides/vllm_docker_quickstart.md

+
+### Quantization
+
+The accuracy of the quantitative model is reduced from FP16 to INT4, which effectively reduces the file size by about 70 %. The main advantage is lower delay and memory usage.


reduces the file size by about 70 % is about the model file size? In vLLM.entrypoint (LLM, openai.api_server, etc.) we are still using the origin model file instead of the ipex_llm converted file, the quantization convert occurs in CPU memory.

xiangyuT · 2024-10-09T01:47:55Z

docs/mddocs/DockerGuides/vllm_docker_quickstart.md

+
+Two scripts are provided in the docker image for model inference.
+
+1. vllm offline inference: `vllm_offline_inference.py`


What is the difference here between Line 63?

line 211 focuses the quantization, line 63 introduces the entire process of ipex offline inferece

xiangyuT · 2024-10-09T01:54:12Z

docs/mddocs/DockerGuides/vllm_docker_quickstart.md

+Codegeex4-all-9b| 1
+Llama-2-13B|2
+Qwen1.5-14b|2
+Baichuan2-13B|4


Why Baichuan2-13B needs 4 gpus

BTW, there is still a known issue for baichuan2 13b. The output is not correct.

xiangyuT

LGTM

ACupofAir added 6 commits October 9, 2024 09:24

[ADD] rewrite new vllm docker quick start

7c28130

[ADD] lora adapter doc finished

932fdc6

[ADD] mulit lora adapter test successfully

eef4659

[ADD] add ipex-llm quantization doc

ec3b118

[UPDATE] update mmdocs vllm_docker_quickstart content

8b91933

[REMOVE] rm tmp file

66ddccb

xiangyuT reviewed Oct 9, 2024

View reviewed changes

[UPDATE] tp and pp explaination and readthedoc link change

121b4cd

xiangyuT reviewed Oct 9, 2024

View reviewed changes

ACupofAir added 4 commits October 9, 2024 10:19

[FIX] fix the error description of tp+pp and quantization part

45a94a0

[FIX] fix the table of verifed model

6f2b274

[UPDATE] add full low bit para list

d1b08d3

[UPDATE] update the load_in_low_bit params to verifed dtype

7591fb8

xiangyuT approved these changes Oct 9, 2024

View reviewed changes

xiangyuT merged commit 412cf8e into intel:main Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[UPDATE] update mddocs/DockerGuides/vllm_docker_quickstart.md #12166

[UPDATE] update mddocs/DockerGuides/vllm_docker_quickstart.md #12166

Uh oh!

ACupofAir commented Oct 9, 2024

Uh oh!

xiangyuT Oct 9, 2024 •

edited

Loading

Uh oh!

xiangyuT Oct 9, 2024

Uh oh!

ACupofAir Oct 9, 2024

Uh oh!

xiangyuT Oct 9, 2024

Uh oh!

xiangyuT Oct 9, 2024

Uh oh!

ACupofAir Oct 9, 2024

Uh oh!

xiangyuT Oct 9, 2024

Uh oh!

xiangyuT Oct 9, 2024

Uh oh!

xiangyuT left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		### Quantization

		The accuracy of the quantitative model is reduced from FP16 to INT4, which effectively reduces the file size by about 70 %. The main advantage is lower delay and memory usage.


		Two scripts are provided in the docker image for model inference.

		1. vllm offline inference: `vllm_offline_inference.py`

[UPDATE] update mddocs/DockerGuides/vllm_docker_quickstart.md #12166

[UPDATE] update mddocs/DockerGuides/vllm_docker_quickstart.md #12166

Uh oh!

Conversation

ACupofAir commented Oct 9, 2024

Description

Uh oh!

xiangyuT Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiangyuT Oct 9, 2024

Choose a reason for hiding this comment

Uh oh!

ACupofAir Oct 9, 2024

Choose a reason for hiding this comment

Uh oh!

xiangyuT Oct 9, 2024

Choose a reason for hiding this comment

Uh oh!

xiangyuT Oct 9, 2024

Choose a reason for hiding this comment

Uh oh!

ACupofAir Oct 9, 2024

Choose a reason for hiding this comment

Uh oh!

xiangyuT Oct 9, 2024

Choose a reason for hiding this comment

Uh oh!

xiangyuT Oct 9, 2024

Choose a reason for hiding this comment

Uh oh!

xiangyuT left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xiangyuT Oct 9, 2024 •

edited

Loading