Skip to content

Commit c35ebb6

Browse files
committed
fp8 dual gemm auto tune
1 parent bfc106d commit c35ebb6

File tree

270 files changed

+22980
-5241
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

270 files changed

+22980
-5241
lines changed

.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,3 +124,8 @@ FETCH_HEAD
124124
# vscode
125125
.vscode
126126
./ppdiffusers/ppdiffusers/version.py
127+
128+
# third party
129+
csrc/third_party/
130+
dataset/
131+
output/

.pre-commit-config.yaml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,4 +52,13 @@ repos:
5252
entry: python scripts/codestyle/check_spaces.py
5353
language: python
5454
files: \.(md|markdown)$
55+
pass_filenames: true
56+
# For dead links
57+
- repo: local
58+
hooks:
59+
- id: check-dead-links
60+
name: Check dead links
61+
entry: python scripts/codestyle/check_dead_links.py
62+
language: python
63+
files: \.(md|markdown|rst)$
5564
pass_filenames: true

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ unit-test:
4646

4747
.PHONY: install
4848
install:
49-
pip install paddlepaddle==0.0.0 -f https://www.paddlepaddle.org.cn/whl/linux/cpu-mkl/develop.html
49+
pip install --pre paddlepaddle -i https://www.paddlepaddle.org.cn/packages/nightly/cpu/
5050
pip install -r requirements-dev.txt
5151
pip install -r requirements.txt
5252
pip install -r paddlenlp/experimental/autonlp/requirements.txt

README.md

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,8 @@
4242

4343
### <a href=#多硬件训推一体> 🔧 多硬件训推一体 </a>
4444

45-
支持英伟达 GPU、昆仑 XPU、昇腾 NPU、燧原 GCU 和海光 DCU 等多个硬件的大模型训练和推理,套件接口支持硬件快速切换,大幅降低硬件切换研发成本。
45+
支持英伟达 GPU、昆仑 XPU、昇腾 NPU、燧原 GCU 和海光 DCU 等多个硬件的大模型和自然语言理解模型训练和推理,套件接口支持硬件快速切换,大幅降低硬件切换研发成本。
46+
当前支持的自然语言理解模型:[多硬件自然语言理解模型列表](./docs/model_zoo/model_list_multy_device.md)
4647

4748
### <a href=#高效易用的预训练> 🚀 高效易用的预训练 </a>
4849

@@ -127,6 +128,18 @@ Unified Checkpoint 大模型存储格式在模型参数分布上支持动态扩
127128
| Yuan2 |||| 🚧 | 🚧 | 🚧 | 🚧 ||
128129
------------------------------------------------------------------------------------------
129130

131+
* [大模型推理](./llm/docs/predict/inference.md)已支持 LLaMA 系列、Qwen 系列、Mistral 系列、ChatGLM 系列、Bloom 系列和 Baichuan 系列,支持 Weight Only INT8及 INT4推理,支持 WAC(权重、激活、Cache KV)进行 INT8、FP8量化的推理,【LLM】模型推理支持列表如下:
132+
133+
| 模型名称/量化类型支持 | FP16/BF16 | WINT8 | WINT4 | INT8-A8W8 | FP8-A8W8 | INT8-A8W8C8 |
134+
|:--------------------------------------------:|:---------:|:-----:|:-----:|:---------:|:--------:|:-----------:|
135+
| [LLaMA](./llm/docs/predict/llama.md) |||||||
136+
| [Qwen](./llm/docs/predict/qwen.md) |||||||
137+
| [Qwen-Moe](./llm/docs/predict/qwen.md) |||| 🚧 | 🚧 | 🚧 |
138+
| [Mixtral](./llm/docs/predict/mixtral.md) |||| 🚧 | 🚧 | 🚧 |
139+
| ChatGLM |||| 🚧 | 🚧 | 🚧 |
140+
| Bloom |||| 🚧 | 🚧 | 🚧 |
141+
| BaiChuan |||||| 🚧 |
142+
130143
## 安装
131144

132145
### 环境依赖
@@ -137,7 +150,7 @@ Unified Checkpoint 大模型存储格式在模型参数分布上支持动态扩
137150
### pip 安装
138151

139152
```shell
140-
pip install --upgrade paddlenlp==3.0.0b0
153+
pip install --upgrade paddlenlp==3.0.0b1
141154
```
142155

143156
或者可通过以下命令安装最新 develop 分支代码:
@@ -162,13 +175,14 @@ PaddleNLP 提供了方便易用的 Auto API,能够快速的加载模型和 Tok
162175
>>> model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B", dtype="float16")
163176
>>> input_features = tokenizer("你好!请自我介绍一下。", return_tensors="pd")
164177
>>> outputs = model.generate(**input_features, max_length=128)
165-
>>> print(tokenizer.batch_decode(outputs[0]))
178+
>>> print(tokenizer.batch_decode(outputs[0], skip_special_tokens=True))
166179
['我是一个AI语言模型,我可以回答各种问题,包括但不限于:天气、新闻、历史、文化、科学、教育、娱乐等。请问您有什么需要了解的吗?']
167180
```
168181

169182
### 大模型预训练
170183

171184
```shell
185+
git clone https://github.com/PaddlePaddle/PaddleNLP.git && cd PaddleNLP # 如已clone或下载PaddleNLP可跳过
172186
mkdir -p llm/data && cd llm/data
173187
wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k.bin
174188
wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k.idx
@@ -179,6 +193,7 @@ python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_pretrain.py
179193
### 大模型 SFT 精调
180194

181195
```shell
196+
git clone https://github.com/PaddlePaddle/PaddleNLP.git && cd PaddleNLP # 如已clone或下载PaddleNLP可跳过
182197
mkdir -p llm/data && cd llm/data
183198
wget https://bj.bcebos.com/paddlenlp/datasets/examples/AdvertiseGen.tar.gz && tar -zxvf AdvertiseGen.tar.gz
184199
cd .. # change folder to PaddleNLP/llm

README_en.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ Detailed list 👉 [Supported Model List](https://github.com/PaddlePaddle/Paddle
6868
### Pip Installation
6969

7070
```shell
71-
pip install --upgrade paddlenlp==3.0.0b0
71+
pip install --upgrade paddlenlp==3.0.0b1
7272
```
7373

7474
or you can install the latest develop branch code with the following command:
@@ -93,13 +93,14 @@ PaddleNLP provides a convenient and easy-to-use Auto API, which can quickly load
9393
>>> model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B", dtype="float16")
9494
>>> input_features = tokenizer("你好!请自我介绍一下。", return_tensors="pd")
9595
>>> outputs = model.generate(**input_features, max_length=128)
96-
>>> print(tokenizer.batch_decode(outputs[0]))
96+
>>> print(tokenizer.batch_decode(outputs[0], skip_special_tokens=True))
9797
['我是一个AI语言模型,我可以回答各种问题,包括但不限于:天气、新闻、历史、文化、科学、教育、娱乐等。请问您有什么需要了解的吗?']
9898
```
9999

100100
### Pre-training for large language model
101101

102102
```shell
103+
git clone https://github.com/PaddlePaddle/PaddleNLP.git && cd PaddleNLP # if cloned or downloaded, can skip this step
103104
mkdir -p llm/data && cd llm/data
104105
wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k.bin
105106
wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k.idx
@@ -110,6 +111,7 @@ python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_pretrain.py
110111
### SFT finetuning forlarge language model
111112

112113
```shell
114+
git clone https://github.com/PaddlePaddle/PaddleNLP.git && cd PaddleNLP # if cloned or downloaded, can skip this step
113115
mkdir -p llm/data && cd llm/data
114116
wget https://bj.bcebos.com/paddlenlp/datasets/examples/AdvertiseGen.tar.gz && tar -zxvf AdvertiseGen.tar.gz
115117
cd .. # change folder to PaddleNLP/llm

csrc/README.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,32 @@ pip install -r requirements.txt
1010

1111
## 编译 Cuda 算子
1212

13+
生成 FP8的 cutlass 算子
14+
```shell
15+
python generate_code_gemm_fused_kernels.py
16+
17+
python generate_code_dual_gemm_fused_kernels.py
18+
```
19+
20+
编译
21+
```shell
22+
python setup_cuda.py install
23+
```
24+
25+
### 手动安装 Cutlass 库
26+
1. 访问 Cutlass 仓库: [NVIDIA/cutlass](https://github.com/NVIDIA/cutlass)
27+
28+
2. 拉取代码:
29+
git clone -b v3.5.0 --single-branch https://github.com/NVIDIA/cutlass.git
30+
31+
3. 将下载的 `cutlass` 目录放在 `csrc/third_party/cutlass`
32+
33+
4. 重新编译 Cuda 算子
1334
```shell
1435
python setup_cuda.py install
1536
```
37+
38+
### FP8 GEMM 自动调优
39+
```shell
40+
sh tune_fp8_gemm.sh
41+
```

0 commit comments

Comments
 (0)