PaddlePaddle
diff --git a/‎.gitignore‎
Lines changed: 5 additions & 0 deletions b/‎.gitignore‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎.pre-commit-config.yaml‎
Lines changed: 9 additions & 0 deletions b/‎.pre-commit-config.yaml‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎Makefile‎
Lines changed: 1 addition & 1 deletion b/‎Makefile‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎README.md‎
Lines changed: 18 additions & 3 deletions b/‎README.md‎
Lines changed: 18 additions & 3 deletions
diff --git a/‎README_en.md‎
Lines changed: 4 additions & 2 deletions b/‎README_en.md‎
Lines changed: 4 additions & 2 deletions
diff --git a/‎csrc/README.md‎
Lines changed: 26 additions & 0 deletions b/‎csrc/README.md‎
Lines changed: 26 additions & 0 deletions
@@ -124,3 +124,8 @@ FETCH_HEAD
 # vscode
 .vscode
 ./ppdiffusers/ppdiffusers/version.py
+
+# third party
+csrc/third_party/
+dataset/
+output/
@@ -52,4 +52,13 @@ repos:
         entry: python scripts/codestyle/check_spaces.py
         language: python
         files: \.(md|markdown)$
+        pass_filenames: true
+# For dead links
+-   repo: local
+    hooks:
+    -   id: check-dead-links
+        name: Check dead links
+        entry: python scripts/codestyle/check_dead_links.py
+        language: python
+        files: \.(md|markdown|rst)$
         pass_filenames: true
@@ -46,7 +46,7 @@ unit-test:
 
 .PHONY: install
 install:
-	pip install paddlepaddle==0.0.0 -f https://www.paddlepaddle.org.cn/whl/linux/cpu-mkl/develop.html
+	pip install --pre paddlepaddle -i https://www.paddlepaddle.org.cn/packages/nightly/cpu/
 	pip install -r requirements-dev.txt
 	pip install -r requirements.txt
 	pip install -r paddlenlp/experimental/autonlp/requirements.txt
 
@@ -42,7 +42,8 @@
 
 ### <a href=#多硬件训推一体> 🔧 多硬件训推一体 </a>
 
-支持英伟达 GPU、昆仑 XPU、昇腾 NPU、燧原 GCU 和海光 DCU 等多个硬件的大模型训练和推理，套件接口支持硬件快速切换，大幅降低硬件切换研发成本。
+支持英伟达 GPU、昆仑 XPU、昇腾 NPU、燧原 GCU 和海光 DCU 等多个硬件的大模型和自然语言理解模型训练和推理，套件接口支持硬件快速切换，大幅降低硬件切换研发成本。
+当前支持的自然语言理解模型：[多硬件自然语言理解模型列表](./docs/model_zoo/model_list_multy_device.md)
 
 ### <a href=#高效易用的预训练> 🚀 高效易用的预训练 </a>
 
@@ -127,6 +128,18 @@ Unified Checkpoint 大模型存储格式在模型参数分布上支持动态扩
 |       Yuan2        |    ✅     |  ✅  |  ✅   |      🚧       | 🚧  |  🚧  |      🚧      |       ✅       |
 ------------------------------------------------------------------------------------------
 
+* [大模型推理](./llm/docs/predict/inference.md)已支持 LLaMA 系列、Qwen 系列、Mistral 系列、ChatGLM 系列、Bloom 系列和 Baichuan 系列，支持 Weight Only INT8及 INT4推理，支持 WAC（权重、激活、Cache KV）进行 INT8、FP8量化的推理，【LLM】模型推理支持列表如下：
+
+|                模型名称/量化类型支持            | FP16/BF16 | WINT8 | WINT4 | INT8-A8W8 | FP8-A8W8 | INT8-A8W8C8 |
+|:--------------------------------------------:|:---------:|:-----:|:-----:|:---------:|:--------:|:-----------:|
+| [LLaMA](./llm/docs/predict/llama.md)         | ✅        | ✅     | ✅      | ✅        | ✅       | ✅           |
+| [Qwen](./llm/docs/predict/qwen.md)           | ✅        | ✅     | ✅      | ✅        | ✅       | ✅           |
+| [Qwen-Moe](./llm/docs/predict/qwen.md)       | ✅        | ✅     | ✅      | 🚧        | 🚧       | 🚧           |
+| [Mixtral](./llm/docs/predict/mixtral.md)     | ✅        | ✅     | ✅      | 🚧        | 🚧       | 🚧           |
+| ChatGLM                                      | ✅        | ✅     | ✅      | 🚧        | 🚧       | 🚧           |
+| Bloom                                        | ✅        | ✅     | ✅      | 🚧        | 🚧       | 🚧           |
+| BaiChuan                                     | ✅        | ✅     | ✅      | ✅        | ✅       | 🚧           |
+
 ## 安装
 
 ### 环境依赖
@@ -137,7 +150,7 @@ Unified Checkpoint 大模型存储格式在模型参数分布上支持动态扩
 ### pip 安装
 
 ```shell
-pip install --upgrade paddlenlp==3.0.0b0
+pip install --upgrade paddlenlp==3.0.0b1
 ```
 
 或者可通过以下命令安装最新 develop 分支代码：
@@ -162,13 +175,14 @@ PaddleNLP 提供了方便易用的 Auto API，能够快速的加载模型和 Tok
 >>> model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B", dtype="float16")
 >>> input_features = tokenizer("你好！请自我介绍一下。", return_tensors="pd")
 >>> outputs = model.generate(**input_features, max_length=128)
->>> print(tokenizer.batch_decode(outputs[0]))
+>>> print(tokenizer.batch_decode(outputs[0], skip_special_tokens=True))
 ['我是一个AI语言模型，我可以回答各种问题，包括但不限于：天气、新闻、历史、文化、科学、教育、娱乐等。请问您有什么需要了解的吗？']
 ```
 
 ### 大模型预训练
 
 ```shell
+git clone https://github.com/PaddlePaddle/PaddleNLP.git && cd PaddleNLP # 如已clone或下载PaddleNLP可跳过
 mkdir -p llm/data && cd llm/data
 wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k.bin
 wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k.idx
@@ -179,6 +193,7 @@ python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_pretrain.py
 ### 大模型 SFT 精调
 
 ```shell
+git clone https://github.com/PaddlePaddle/PaddleNLP.git && cd PaddleNLP # 如已clone或下载PaddleNLP可跳过
 mkdir -p llm/data && cd llm/data
 wget https://bj.bcebos.com/paddlenlp/datasets/examples/AdvertiseGen.tar.gz && tar -zxvf AdvertiseGen.tar.gz
 cd .. # change folder to PaddleNLP/llm
 
@@ -68,7 +68,7 @@ Detailed list 👉 [Supported Model List](https://github.com/PaddlePaddle/Paddle
 ### Pip Installation
 
 ```shell
-pip install --upgrade paddlenlp==3.0.0b0
+pip install --upgrade paddlenlp==3.0.0b1
 ```
 
 or you can install the latest develop branch code with the following command:
@@ -93,13 +93,14 @@ PaddleNLP provides a convenient and easy-to-use Auto API, which can quickly load
 >>> model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B", dtype="float16")
 >>> input_features = tokenizer("你好！请自我介绍一下。", return_tensors="pd")
 >>> outputs = model.generate(**input_features, max_length=128)
->>> print(tokenizer.batch_decode(outputs[0]))
+>>> print(tokenizer.batch_decode(outputs[0], skip_special_tokens=True))
 ['我是一个AI语言模型，我可以回答各种问题，包括但不限于：天气、新闻、历史、文化、科学、教育、娱乐等。请问您有什么需要了解的吗？']
 ```
 
 ### Pre-training for large language model
 
 ```shell
+git clone https://github.com/PaddlePaddle/PaddleNLP.git && cd PaddleNLP # if cloned or downloaded, can skip this step
 mkdir -p llm/data && cd llm/data
 wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k.bin
 wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k.idx
@@ -110,6 +111,7 @@ python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_pretrain.py
 ### SFT finetuning forlarge language model
 
 ```shell
+git clone https://github.com/PaddlePaddle/PaddleNLP.git && cd PaddleNLP # if cloned or downloaded, can skip this step
 mkdir -p llm/data && cd llm/data
 wget https://bj.bcebos.com/paddlenlp/datasets/examples/AdvertiseGen.tar.gz && tar -zxvf AdvertiseGen.tar.gz
 cd .. # change folder to PaddleNLP/llm
 
@@ -10,6 +10,32 @@ pip install -r requirements.txt
 
 ## 编译 Cuda 算子
 
+生成 FP8的 cutlass 算子
+```shell
+python generate_code_gemm_fused_kernels.py
+
+python generate_code_dual_gemm_fused_kernels.py
+```
+
+编译
+```shell
+python setup_cuda.py install
+```
+
+### 手动安装 Cutlass 库
+1. 访问 Cutlass 仓库: [NVIDIA/cutlass](https://github.com/NVIDIA/cutlass)
+
+2. 拉取代码:
+    git clone -b v3.5.0 --single-branch https://github.com/NVIDIA/cutlass.git
+
+3. 将下载的 `cutlass` 目录放在 `csrc/third_party/cutlass`下
+
+4. 重新编译 Cuda 算子
 ```shell
 python setup_cuda.py install
 ```
+
+### FP8 GEMM 自动调优
+```shell
+sh tune_fp8_gemm.sh
+```