Skip to content

Commit 1fc23a8

Browse files
ymyjlZHUI
andauthored
[Tutorial] Add torch migration tutorial (#3641)
Co-authored-by: Zhong Hui <[email protected]>
1 parent afeb623 commit 1fc23a8

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+5615
-0
lines changed

examples/torch_migration/README.md

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
# BERT-SST2-Prod
2+
Reproduction process of BERT on SST2 dataset
3+
4+
# 安装说明
5+
6+
* 下载代码库
7+
8+
```shell
9+
git clone https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/torch_migration
10+
```
11+
12+
* 进入文件夹,安装requirements
13+
14+
```shell
15+
pip install -r requirements.txt
16+
```
17+
18+
* 安装PaddlePaddle与PyTorch
19+
20+
```shell
21+
# CPU版本的PaddlePaddle
22+
pip install paddlepaddle==2.2.0 -i https://mirror.baidu.com/pypi/simple
23+
# 如果希望安装GPU版本的PaddlePaddle,可以使用下面的命令
24+
# pip install paddlepaddle-gpu==2.2.0.post112 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
25+
# 安装PyTorch
26+
pip install torch==1.10.0+cu113 torchvision==0.11.1+cu113 torchaudio==0.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
27+
```
28+
29+
**注意**: 本项目依赖于paddlepaddle-2.2.0版本,安装时需要注意。
30+
31+
* 验证PaddlePaddle是否安装成功
32+
33+
运行python,输入下面的命令。
34+
35+
```shell
36+
import paddle
37+
paddle.utils.run_check()
38+
print(paddle.__version__)
39+
```
40+
41+
如果输出下面的内容,则说明PaddlePaddle安装成功。
42+
43+
```
44+
PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.
45+
2.2.0
46+
```
47+
48+
49+
* 验证PyTorch是否安装成功
50+
51+
运行python,输入下面的命令,如果可以正常输出,则说明torch安装成功。
52+
53+
```shell
54+
import torch
55+
print(torch.__version__)
56+
# 如果安装的是cpu版本,可以按照下面的命令确认torch是否安装成功
57+
# 期望输出为 tensor([1.])
58+
print(torch.Tensor([1.0]))
59+
# 如果安装的是gpu版本,可以按照下面的命令确认torch是否安装成功
60+
# 期望输出为 tensor([1.], device='cuda:0')
61+
print(torch.Tensor([1.0]).cuda())
62+
```

examples/torch_migration/docs/ThesisReproduction_NLP.md

Lines changed: 928 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
# 使用方法
2+
3+
4+
本部分内容以前向对齐为例,介绍基于`repord_log`工具对齐的检查流程。其中与`reprod_log`工具有关的部分都是需要开发者需要添加的部分。
5+
6+
7+
```shell
8+
# 进入文件夹并生成torch的bert模型权重
9+
cd pipeline/weights/ && python torch_bert_weights.py
10+
# 进入文件夹并将torch的bert模型权重转换为paddle
11+
cd pipeline/weights/ && python torch2paddle.py
12+
# 进入文件夹并生成classifier权重
13+
cd pipeline/classifier_weights/ && python generate_classifier_weights.py
14+
# 进入Step1文件夹
15+
cd pipeline/Step1/
16+
# 生成paddle的前向数据
17+
python pd_forward_bert.py
18+
# 生成torch的前向数据
19+
python pt_forward_bert.py
20+
# 对比生成log
21+
python check_step1.py
22+
```
23+
24+
具体地,以PaddlePaddle为例,`pd_forward_bert.py`的具体代码如下所示。
25+
26+
```python
27+
import numpy as np
28+
import paddle
29+
from reprod_log import ReprodLogger
30+
import sys
31+
import os
32+
CURRENT_DIR = os.path.split(os.path.abspath(__file__))[0] # 当前目录
33+
config_path = CURRENT_DIR.rsplit('/', 1)[0]
34+
sys.path.append(config_path)
35+
from models.pd_bert import *
36+
37+
# 导入reprod_log中的ReprodLogger类
38+
from reprod_log import ReprodLogger
39+
40+
reprod_logger = ReprodLogger()
41+
42+
# 组网初始化加载BertModel权重
43+
paddle_dump_path = '../weights/paddle_weight.pdparams'
44+
config = BertConfig()
45+
model = BertForSequenceClassification(config)
46+
checkpoint = paddle.load(paddle_dump_path)
47+
model.bert.load_dict(checkpoint)
48+
49+
# 加载分类权重
50+
classifier_weights = paddle.load(
51+
"../classifier_weights/paddle_classifier_weights.bin")
52+
model.load_dict(classifier_weights)
53+
model.eval()
54+
# 读入fake data并转换为tensor,这里也可以固定seed在线生成fake data
55+
fake_data = np.load("../fake_data/fake_data.npy")
56+
fake_data = paddle.to_tensor(fake_data)
57+
# 模型前向
58+
out = model(fake_data)
59+
# 保存前向结果,对于不同的任务,需要开发者添加。
60+
reprod_logger.add("logits", out.cpu().detach().numpy())
61+
reprod_logger.save("forward_paddle.npy")
62+
```
63+
64+
diff检查的代码可以参考:[check_step1.py](./check_step1.py),具体代码如下所示。
65+
66+
```python
67+
# https://github.com/littletomatodonkey/AlexNet-Prod/blob/master/pipeline/Step1/check_step1.py
68+
# 使用reprod_log排查diff
69+
from reprod_log import ReprodDiffHelper
70+
if __name__ == "__main__":
71+
diff_helper = ReprodDiffHelper()
72+
torch_info = diff_helper.load_info("./forward_torch.npy")
73+
paddle_info = diff_helper.load_info("./forward_paddle.npy")
74+
diff_helper.compare_info(torch_info, paddle_info)
75+
diff_helper.report(path="forward_diff.log")
76+
```
77+
78+
产出日志如下,同时会将check的结果保存在`forward_diff.log`文件中。
79+
80+
```
81+
[2021/11/17 20:15:50] root INFO: logits:
82+
[2021/11/17 20:15:50] root INFO: mean diff: check passed: True, value: 1.30385160446167e-07
83+
[2021/11/17 20:15:50] root INFO: diff check passed
84+
```
85+
86+
平均绝对误差为1.3e-7,测试通过。
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
from reprod_log import ReprodDiffHelper
16+
17+
if __name__ == "__main__":
18+
diff_helper = ReprodDiffHelper()
19+
torch_info = diff_helper.load_info("./forward_torch.npy")
20+
paddle_info = diff_helper.load_info("./forward_paddle.npy")
21+
22+
diff_helper.compare_info(torch_info, paddle_info)
23+
diff_helper.report(path="forward_diff.log")
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
import sys
15+
import os
16+
17+
import numpy as np
18+
import paddle
19+
from reprod_log import ReprodLogger
20+
21+
CURRENT_DIR = os.path.split(os.path.abspath(__file__))[0] # 当前目录
22+
CONFIG_PATH = CURRENT_DIR.rsplit('/', 1)[0]
23+
sys.path.append(CONFIG_PATH)
24+
25+
from models.pd_bert import BertConfig, BertForSequenceClassification
26+
27+
if __name__ == "__main__":
28+
paddle.set_device("cpu")
29+
30+
# def logger
31+
reprod_logger = ReprodLogger()
32+
33+
paddle_dump_path = '../weights/paddle_weight.pdparams'
34+
config = BertConfig()
35+
model = BertForSequenceClassification(config)
36+
checkpoint = paddle.load(paddle_dump_path)
37+
model.bert.load_dict(checkpoint)
38+
39+
classifier_weights = paddle.load(
40+
"../classifier_weights/paddle_classifier_weights.bin")
41+
model.load_dict(classifier_weights)
42+
model.eval()
43+
# read or gen fake data
44+
45+
fake_data = np.load("../fake_data/fake_data.npy")
46+
fake_data = paddle.to_tensor(fake_data)
47+
# forward
48+
out = model(fake_data)[0]
49+
reprod_logger.add("logits", out.cpu().detach().numpy())
50+
reprod_logger.save("forward_paddle.npy")
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
import sys
15+
import os
16+
17+
import numpy as np
18+
from reprod_log import ReprodLogger
19+
import torch
20+
21+
CURRENT_DIR = os.path.split(os.path.abspath(__file__))[0] # 当前目录
22+
CONFIG_PATH = CURRENT_DIR.rsplit('/', 1)[0]
23+
sys.path.append(CONFIG_PATH)
24+
25+
from models.pt_bert import BertConfig, BertForSequenceClassification
26+
27+
if __name__ == "__main__":
28+
# def logger
29+
reprod_logger = ReprodLogger()
30+
31+
pytorch_dump_path = '../weights/torch_weight.bin'
32+
config = BertConfig()
33+
model = BertForSequenceClassification(config)
34+
checkpoint = torch.load(pytorch_dump_path)
35+
model.bert.load_state_dict(checkpoint)
36+
37+
classifier_weights = torch.load(
38+
"../classifier_weights/torch_classifier_weights.bin")
39+
model.load_state_dict(classifier_weights, strict=False)
40+
model.eval()
41+
42+
# read or gen fake data
43+
fake_data = np.load("../fake_data/fake_data.npy")
44+
fake_data = torch.from_numpy(fake_data)
45+
# forward
46+
out = model(fake_data)[0]
47+
reprod_logger.add("logits", out.cpu().detach().numpy())
48+
reprod_logger.save("forward_torch.npy")
Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
from collections import OrderedDict
16+
17+
import numpy as np
18+
import paddle
19+
import torch
20+
from paddlenlp.transformers import BertForPretraining as PDBertForMaskedLM
21+
from transformers import BertForMaskedLM as PTBertForMaskedLM
22+
23+
24+
def convert_pytorch_checkpoint_to_paddle(
25+
pytorch_checkpoint_path="pytorch_model.bin",
26+
paddle_dump_path="model_state.pdparams",
27+
version="old",
28+
):
29+
hf_to_paddle = {
30+
"embeddings.LayerNorm": "embeddings.layer_norm",
31+
"encoder.layer": "encoder.layers",
32+
"attention.self.query": "self_attn.q_proj",
33+
"attention.self.key": "self_attn.k_proj",
34+
"attention.self.value": "self_attn.v_proj",
35+
"attention.output.dense": "self_attn.out_proj",
36+
"intermediate.dense": "linear1",
37+
"output.dense": "linear2",
38+
"attention.output.LayerNorm": "norm1",
39+
"output.LayerNorm": "norm2",
40+
"predictions.decoder.": "predictions.decoder_",
41+
"predictions.transform.dense": "predictions.transform",
42+
"predictions.transform.LayerNorm": "predictions.layer_norm",
43+
}
44+
do_not_transpose = []
45+
if version == "old":
46+
hf_to_paddle.update({
47+
"predictions.bias": "predictions.decoder_bias",
48+
".gamma": ".weight",
49+
".beta": ".bias",
50+
})
51+
do_not_transpose = do_not_transpose + ["predictions.decoder.weight"]
52+
53+
pytorch_state_dict = torch.load(pytorch_checkpoint_path, map_location="cpu")
54+
paddle_state_dict = OrderedDict()
55+
for k, v in pytorch_state_dict.items():
56+
is_transpose = False
57+
if k[-7:] == ".weight":
58+
# embeddings.weight and LayerNorm.weight do not transpose
59+
if all(d not in k for d in do_not_transpose):
60+
if ".embeddings." not in k and ".LayerNorm." not in k:
61+
if v.ndim == 2:
62+
if 'embeddings' not in k:
63+
v = v.transpose(0, 1)
64+
is_transpose = True
65+
is_transpose = False
66+
oldk = k
67+
print(f"Converting: {oldk} => {k} | is_transpose {is_transpose}")
68+
paddle_state_dict[k] = v.data.numpy()
69+
70+
paddle.save(paddle_state_dict, paddle_dump_path)
71+
72+
73+
def compare(out_torch, out_paddle):
74+
out_torch = out_torch.detach().numpy()
75+
out_paddle = out_paddle.detach().numpy()
76+
assert out_torch.shape == out_paddle.shape
77+
abs_dif = np.abs(out_torch - out_paddle)
78+
mean_dif = np.mean(abs_dif)
79+
max_dif = np.max(abs_dif)
80+
min_dif = np.min(abs_dif)
81+
print("mean_dif:{}".format(mean_dif))
82+
print("max_dif:{}".format(max_dif))
83+
print("min_dif:{}".format(min_dif))
84+
85+
86+
def test_forward():
87+
paddle.set_device("cpu")
88+
model_torch = PTBertForMaskedLM.from_pretrained("./bert-base-uncased")
89+
model_paddle = PDBertForMaskedLM.from_pretrained("./bert-base-uncased")
90+
model_torch.eval()
91+
model_paddle.eval()
92+
np.random.seed(42)
93+
x = np.random.randint(1,
94+
model_paddle.bert.config["vocab_size"],
95+
size=(4, 64))
96+
input_torch = torch.tensor(x, dtype=torch.int64)
97+
out_torch = model_torch(input_torch)[0]
98+
99+
input_paddle = paddle.to_tensor(x, dtype=paddle.int64)
100+
out_paddle = model_paddle(input_paddle)[0]
101+
102+
print("torch result shape:{}".format(out_torch.shape))
103+
print("paddle result shape:{}".format(out_paddle.shape))
104+
compare(out_torch, out_paddle)
105+
106+
107+
if __name__ == "__main__":
108+
convert_pytorch_checkpoint_to_paddle("test.bin", "test_paddle.pdparams")
109+
# test_forward()
110+
# torch result shape:torch.Size([4, 64, 30522])
111+
# paddle result shape:[4, 64, 30522]
112+
# mean_dif:1.666686512180604e-05
113+
# max_dif:0.00015211105346679688
114+
# min_dif:0.0

0 commit comments

Comments
 (0)