update

Sunny-bot1 · Sunny-bot1 · commit 66d33ca8b37f · 2024-08-26T20:36:39.000+08:00
diff --git a/llm/docs/dcu_install.md b/llm/docs/dcu_install.md
@@ -64,4 +64,4 @@ cd -
 ```
 
 ### 高性能推理：
-海光的推理命令与GPU推理命令一致，请参考[大模型推理教程](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/inference.md).
+海光的推理命令与GPU推理命令一致，请参考[大模型推理教程](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/inference.md).
diff --git a/llm/docs/predict/inference.md b/llm/docs/predict/inference.md
@@ -17,7 +17,7 @@ PaddleNLP大模型推理提供压缩、推理、服务全流程体验 ：
 
 - 提供多种PTQ技术，提供WAC（权重/激活/缓存）灵活可配的量化能力，支持INT8、FP8、4Bit量化能力
 
-- 支持多硬件大模型推理，包括[昆仑XPU](../../../csrc/xpu/README.md)、[昇腾NPU](../../npu/llama/README.md)、[海光DCU](../dcu_install.md)、[海光K100]()、[燧原GCU]()、[X86 CPU](../../../csrc/cpu/README.md)等
+- 支持多硬件大模型推理，包括[昆仑XPU](../../xpu/llama/README.md)、[昇腾NPU](../../npu/llama/README.md)、[海光K100](../dcu_install.md)、[燧原GCU](../../gcu/llama/README.md)、[X86 CPU](../../../csrc/cpu/README.md)等
 
 - 提供面向服务器场景的部署服务，支持连续批处理(continuous batching)、流式输出等功能，支持HTTP、RPC、RESTful多种Clent端形式
 
@@ -47,13 +47,13 @@ PaddleNLP 中已经添加高性能推理模型相关实现，支持：
 
 PaddleNLP 提供了多种硬件平台和精度支持，包括：
 
-| Precision      | Ada | Ampere | Turing | Volta | GCU | XPU | NPU | DCU | K100 | x86 CPU  |
-|----------------|-----|--------|--------|-------|-----|-----|-----|-----|------|----------|
-| FP32           |  ✅ | ✅ | ✅ | ✅  | ✅  |  ✅ | ✅ | ✅  |  ✅ | ✅ |
-| FP16           |  ✅ | ✅ | ✅ | ✅  | ✅  |  ✅ | ✅ | ✅  |  ✅ | ✅ |
-| BF16           |  ✅ | ✅ | ❌ | ❌  | ❌  |  ❌ | ❌ | ❌  |  ❌ | ✅ |
-| INT8           |  ✅ | ✅ | ✅ | ✅  | ✅  |  ✅ | ❌ | ❌  |  ❌ | ❌ |
-| FP8            |  ✅ | ❌ | ❌ | ❌  | ❌  |  ❌ | ❌ | ❌  |  ❌ | ❌ |
+| Precision      | Ada | Ampere | Turing | Volta | 昆仑XPU | 昇腾NPU | 海光K100 | 燧原GCU | x86 CPU |
+|----------------|-----|--------|--------|-------|--------|---------|---------|--------|---------|
+| FP32           |  ✅ | ✅     | ✅      | ✅    | ✅      |  ✅     | ✅      | ✅      |   ✅    |
+| FP16           |  ✅ | ✅     | ✅      | ✅    | ✅      |  ✅     | ✅      | ✅      |   ✅    |
+| BF16           |  ✅ | ✅     | ❌      | ❌    | ❌      |  ❌     | ❌      | ❌      |   ✅    |
+| INT8           |  ✅ | ✅     | ✅      | ✅    | ❌      |  ❌     | ✅      | ❌      |   ✅    |
+| FP8            |  ✅ | ❌     | ❌      | ❌    | ❌      |  ❌     | ❌      | ❌      |   ❌    |
 
 
 ## 3. 推理参数
@@ -186,7 +186,9 @@ python ./predict/predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --
 
 更多大模型推理教程，参考：
 
--  [llm/docs/predict](./)下的示例
+-  [llama](./llama.md)
+-  [qwen](./qwen.md)
+-  [mixtral](./mixtral.md)
 
 环境准备，参考：
 
diff --git a/llm/docs/predict/llama.md b/llm/docs/predict/llama.md
@@ -115,6 +115,6 @@ generation_config = GenerationConfig.from_pretrained("meta-llama/Meta-Llama-3.1-
 python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" predict/export_model.py --model_name_or_path meta-llama/Meta-Llama-3.1-405B-Instruct --output_path /path/to/a8w8c8_tp8 --inference_model 1 --block_attn 1 --dtype bfloat16 --quant_type a8w8 --cachekv_int8_type static --use_fake_parameter 1
 
 # 推理
-python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" predict/predictor.py --model_name_or_path /path/to/a8w8c8_tp8 --mode static --inference_model 1 --block_attn 1 --dtype bfloat16
+python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" predict/predictor.py --model_name_or_path /path/to/a8w8c8_tp8 --mode static --inference_model 1 --block_attn 1 --dtype bfloat16 --quant_type a8w8 --cachekv_int8_type static 
 ```