Skip to content

Commit 66d33ca

Browse files
committed
update
1 parent 34628ba commit 66d33ca

File tree

3 files changed

+13
-11
lines changed

3 files changed

+13
-11
lines changed

llm/docs/dcu_install.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,4 +64,4 @@ cd -
6464
```
6565

6666
### 高性能推理:
67-
海光的推理命令与GPU推理命令一致,请参考[大模型推理教程](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/inference.md).
67+
海光的推理命令与GPU推理命令一致,请参考[大模型推理教程](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/inference.md).

llm/docs/predict/inference.md

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ PaddleNLP大模型推理提供压缩、推理、服务全流程体验 :
1717

1818
- 提供多种PTQ技术,提供WAC(权重/激活/缓存)灵活可配的量化能力,支持INT8、FP8、4Bit量化能力
1919

20-
- 支持多硬件大模型推理,包括[昆仑XPU](../../../csrc/xpu/README.md)[昇腾NPU](../../npu/llama/README.md)[海光DCU](../dcu_install.md)[海光K100]()[燧原GCU]()[X86 CPU](../../../csrc/cpu/README.md)
20+
- 支持多硬件大模型推理,包括[昆仑XPU](../../xpu/llama/README.md)[昇腾NPU](../../npu/llama/README.md)[海光K100](../dcu_install.md)[燧原GCU](../../gcu/llama/README.md)[X86 CPU](../../../csrc/cpu/README.md)
2121

2222
- 提供面向服务器场景的部署服务,支持连续批处理(continuous batching)、流式输出等功能,支持HTTP、RPC、RESTful多种Clent端形式
2323

@@ -47,13 +47,13 @@ PaddleNLP 中已经添加高性能推理模型相关实现,支持:
4747

4848
PaddleNLP 提供了多种硬件平台和精度支持,包括:
4949

50-
| Precision | Ada | Ampere | Turing | Volta | GCU | XPU | NPU | DCU | K100 | x86 CPU |
51-
|----------------|-----|--------|--------|-------|-----|-----|-----|-----|------|----------|
52-
| FP32 |||||||||| |
53-
| FP16 |||||||||| |
54-
| BF16 |||||||||| |
55-
| INT8 |||||||| |||
56-
| FP8 |||||||||| |
50+
| Precision | Ada | Ampere | Turing | Volta | 昆仑XPU | 昇腾NPU | 海光K100 | 燧原GCU | x86 CPU |
51+
|----------------|-----|--------|--------|-------|--------|---------|---------|--------|---------|
52+
| FP32 || | | | | | | | |
53+
| FP16 || | | | | | | | |
54+
| BF16 || | | | | | | | |
55+
| INT8 || | || | || | |
56+
| FP8 || | | | | | | | |
5757

5858

5959
## 3. 推理参数
@@ -186,7 +186,9 @@ python ./predict/predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --
186186

187187
更多大模型推理教程,参考:
188188

189-
- [llm/docs/predict](./)下的示例
189+
- [llama](./llama.md)
190+
- [qwen](./qwen.md)
191+
- [mixtral](./mixtral.md)
190192

191193
环境准备,参考:
192194

llm/docs/predict/llama.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -115,6 +115,6 @@ generation_config = GenerationConfig.from_pretrained("meta-llama/Meta-Llama-3.1-
115115
python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" predict/export_model.py --model_name_or_path meta-llama/Meta-Llama-3.1-405B-Instruct --output_path /path/to/a8w8c8_tp8 --inference_model 1 --block_attn 1 --dtype bfloat16 --quant_type a8w8 --cachekv_int8_type static --use_fake_parameter 1
116116

117117
# 推理
118-
python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" predict/predictor.py --model_name_or_path /path/to/a8w8c8_tp8 --mode static --inference_model 1 --block_attn 1 --dtype bfloat16
118+
python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" predict/predictor.py --model_name_or_path /path/to/a8w8c8_tp8 --mode static --inference_model 1 --block_attn 1 --dtype bfloat16 --quant_type a8w8 --cachekv_int8_type static
119119
```
120120

0 commit comments

Comments
 (0)