Skip to content

Conversation

ACupofAir
Copy link
Collaborator

Description

Update vllm docker quick start document with vllm 0.5.4 version.

|:---|:---|
|`model="YOUR_MODEL"`| the model path in docker, for example "/llm/models/Llama-2-7b-chat-hf"|
|`load_in_low_bit="fp8"`| model quantization accuracy, acceptable `fp8`, `fp6`, `sym_int4`, default is `fp8`|
|`tensor_parallel_size=1`| number of graphics cards used by the model, default is `1`|
Copy link
Contributor

@xiangyuT xiangyuT Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also mention pp here? The explanation may not be entirely accurate and the official one is Number of tensor parallel replicas. You could refer to docs here

-pp 2
```

3. **TP+PP Serving**: using tensor-parallel and pipline-parallel mixed, for example, using 2 cards for tp and 2 cards for pp serving, add following parameter:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be confusing. It's not a 2tp + 2tp architecture but more like a 2 pp * (2 tp) architecture. Could refer the description here: https://docs.vllm.ai/en/stable/serving/distributed_serving.html

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


### Quantization

The accuracy of the quantitative model is reduced from FP16 to INT4, which effectively reduces the file size by about 70 %. The main advantage is lower delay and memory usage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reduces the file size by about 70 % is about the model file size? In vLLM.entrypoint (LLM, openai.api_server, etc.) we are still using the origin model file instead of the ipex_llm converted file, the quantization convert occurs in CPU memory.


Two scripts are provided in the docker image for model inference.

1. vllm offline inference: `vllm_offline_inference.py`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference here between Line 63?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 211 focuses the quantization, line 63 introduces the entire process of ipex offline inferece

Codegeex4-all-9b| 1
Llama-2-13B|2
Qwen1.5-14b|2
Baichuan2-13B|4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Baichuan2-13B needs 4 gpus

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, there is still a known issue for baichuan2 13b. The output is not correct.

Copy link
Contributor

@xiangyuT xiangyuT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xiangyuT xiangyuT merged commit 412cf8e into intel:main Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants