Skip to content

Commit ee4944e

Browse files
authored
add paddle nv-embed-v1 (#8785)
* add paddle nv-embed-v1 * rename hf_model and use config in models
1 parent 39632e9 commit ee4944e

File tree

4 files changed

+336
-77
lines changed

4 files changed

+336
-77
lines changed

legacy/pipelines/examples/contrastive_training/README.md

Lines changed: 61 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -2,18 +2,19 @@
22

33
## 安装
44

5-
推荐安装gpu版本的[PaddlePaddle](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/conda/linux-conda.html)以cuda11.7的paddle为例,安装命令如下:
5+
推荐安装 gpu 版本的[PaddlePaddle](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/conda/linux-conda.html)以 cuda11.7的 paddle 为例,安装命令如下:
66

77
```
88
conda install nccl -c conda-forge
99
conda install paddlepaddle-gpu==2.6.1 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ -c conda-forge
1010
```
1111
安装其他依赖:
1212
```
13+
pip install git+https://github.com/PaddlePaddle/PaddleNLP.git@develop
1314
pip install -r requirements.txt
1415
```
1516

16-
下载DuReader-Retrieval中文数据集
17+
下载 DuReader-Retrieval 中文数据集
1718

1819
```
1920
cd data
@@ -42,34 +43,34 @@ python train.py --do_train \
4243
--use_matryoshka
4344
```
4445

45-
- `model_name_or_path`: 选择预训练模型,可选rocketqa-zh-base-query-encoder
46+
- `model_name_or_path`: 选择预训练模型,可选 rocketqa-zh-base-query-encoder
4647
- `output_dir`: 模型保存路径
47-
- `train_data`: 训练数据集路径,这里使用的是dureader中文数据集
48-
- `overwrite_output_dir`: 是否覆盖模型保存路径,默认为False
49-
- `fine_tune_type`: 训练模式,可选sft和lora, bitfit等策略
50-
- `sentence_pooling_method`: 句子池化方法,可选cls和mean, cls为CLS层,mean为平均池化
48+
- `train_data`: 训练数据集路径,这里使用的是 dureader 中文数据集
49+
- `overwrite_output_dir`: 是否覆盖模型保存路径,默认为 False
50+
- `fine_tune_type`: 训练模式,可选 sft 和 lora, bitfit 等策略
51+
- `sentence_pooling_method`: 句子池化方法,可选 cls 和 mean, cls 为 CLS 层,mean 为平均池化
5152
- `num_train_epochs`: 训练轮数
52-
- `per_device_train_batch_size`: 单卡训练batch大小
53+
- `per_device_train_batch_size`: 单卡训练 batch 大小
5354
- `learning_rate`: 学习率
54-
- `train_group_size`: 每个训练集正负样本的数据,默认为8,例如train_group_size=4,则每个训练集包含1个正样本和3个负样本
55+
- `train_group_size`: 每个训练集正负样本的数据,默认为8,例如 train_group_size=4,则每个训练集包含1个正样本和3个负样本
5556
- `max_example_num_per_dataset`: 每个训练集的最大样本数,默认为100000000
56-
- `recompute`: 是否重新计算,默认为False
57-
- `query_max_len`: query的最大长度,默认为32
58-
- `query_instruction_for_retrieval`: query的检索指令,默认为None
59-
- `passage_instruction_for_retrieval`: passage的检索指令,默认为None
60-
- `passage_max_len`: passage的最大长度,默认为512
61-
- `use_matryoshka`: 是否使用俄罗斯套娃策略(matryoshka),默认为False
57+
- `recompute`: 是否重新计算,默认为 False
58+
- `query_max_len`: query 的最大长度,默认为32
59+
- `query_instruction_for_retrieval`: query 的检索指令,默认为 None
60+
- `passage_instruction_for_retrieval`: passage 的检索指令,默认为 None
61+
- `passage_max_len`: passage 的最大长度,默认为512
62+
- `use_matryoshka`: 是否使用俄罗斯套娃策略(matryoshka),默认为 False
6263
- `matryoshka_dims`: 俄罗斯套娃策略的维度,默认为[64, 128, 256, 512, 768]
6364
- `matryoshka_loss_weights`: 俄罗斯套娃策略的损失权重,默认为[1, 1, 1, 1, 1]
64-
- `use_inbatch_neg`: 是否使用in batch negatives策略,默认为False
65-
- `use_flash_attention`: 是否使用flash attention,默认为False
66-
- `temperature`: in batch negatives策略的temperature参数,默认为0.02
67-
- `negatives_cross_device`: 跨设备in batch negatives策略,默认为False
68-
- `margin`: in batch negatives策略的margin参数,默认为0.2
65+
- `use_inbatch_neg`: 是否使用 in batch negatives 策略,默认为 False
66+
- `use_flash_attention`: 是否使用 flash attention,默认为 False
67+
- `temperature`: in batch negatives 策略的 temperature 参数,默认为0.02
68+
- `negatives_cross_device`: 跨设备 in batch negatives 策略,默认为 False
69+
- `margin`: in batch negatives 策略的 margin 参数,默认为0.2
6970

7071
### 多卡训练
7172

72-
单卡训练效率过低,batch_size较小,建议使用多卡训练,对于对比学习训练推荐使用大batch_size,多卡训练,示例命令如下:
73+
单卡训练效率过低,batch_size 较小,建议使用多卡训练,对于对比学习训练推荐使用大 batch_size,多卡训练,示例命令如下:
7374

7475
```
7576
python -m paddle.distributed.launch --gpus "1,2,3,4" train.py --do_train \
@@ -100,21 +101,42 @@ python evaluation/benchmarks.py --model_type bert \
100101
--query_max_length 64 \
101102
--passage_max_length 512 \
102103
```
103-
- `model_type`: 模型的类似,可选bert或roberta等等
104-
- `query_model`: query向量模型的路径
105-
- `passage_model`: passage向量模型的路径
106-
- `query_max_length`: query的最大长度
107-
- `passage_max_length`: passage的最大长度
108-
- `evaluate_all`: 是否评估所有的checkpoint,默认为False,即只评估指定的checkpoint
104+
- `model_type`: 模型的类似,可选 bert 或 roberta 等等
105+
- `query_model`: query 向量模型的路径
106+
- `passage_model`: passage 向量模型的路径
107+
- `query_max_length`: query 的最大长度
108+
- `passage_max_length`: passage 的最大长度
109+
- `evaluate_all`: 是否评估所有的 checkpoint,默认为 False,即只评估指定的 checkpoint
109110
- `checkpoint_dir`: 与`evaluate_all`一起使用
110111

111112

112-
## MTEB评估
113+
## MTEB 评估
113114
[MTEB](https://github.com/embeddings-benchmark/mteb)
114115
是一个大规模文本嵌入评测基准,包含了丰富的向量检索评估任务和数据集。
115-
本仓库主要面向其中的中英文检索任务(Retrieval),并以SciFact数据集作为主要示例
116+
本仓库主要面向其中的中英文检索任务(Retrieval),并以 SciFact 数据集作为主要示例
116117

117-
评估RepLLaMA向量检索模型([repllama-v1-7b-lora-passage](https://huggingface.co/castorini/repllama-v1-7b-lora-passage)):
118+
评估 NV-Embed 向量检索模型([NV-Embed-v1](https://huggingface.co/nvidia/NV-Embed-v1)):
119+
```
120+
export CUDA_VISIBLE_DEVICES=0
121+
python eval_mteb.py \
122+
--base_model_name_or_path NV-Embed-v1 \
123+
--output_folder en_results/nv-embed-v1 \
124+
--query_instruction "Given a claim, find documents that refute the claim" \
125+
--task_name 'SciFact' \
126+
--eval_batch_size 8
127+
```
128+
结果文件保存在`en_results/nv-embed-v1/SciFact/last/no_model_name_available/no_revision_available/SciFact.json`,包含以下类似的评估结果:
129+
```
130+
'ndcg_at_1': 0.67667,
131+
'ndcg_at_3': 0.73826,
132+
'ndcg_at_5': 0.76662,
133+
'ndcg_at_10': 0.783,
134+
'ndcg_at_20': 0.7936,
135+
'ndcg_at_100': 0.80206,
136+
'ndcg_at_1000': 0.80444
137+
```
138+
139+
评估 RepLLaMA 向量检索模型([repllama-v1-7b-lora-passage](https://huggingface.co/castorini/repllama-v1-7b-lora-passage)):
118140
```
119141
export CUDA_VISIBLE_DEVICES=0
120142
python evaluation/mteb/eval_mteb.py \
@@ -143,7 +165,7 @@ python evaluation/mteb/eval_mteb.py \
143165
'ndcg_at_1000': 0.7794
144166
```
145167

146-
评估BGE向量检索模型[bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5)):
168+
评估 BGE 向量检索模型[bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5)):
147169
```
148170
export CUDA_VISIBLE_DEVICES=0
149171
python evaluation/mteb/eval_mteb.py \
@@ -174,15 +196,15 @@ python evaluation/mteb/eval_mteb.py \
174196
可支持配置的参数:
175197
- `base_model_name_or_path`: 模型名称或路径
176198
- `output_folder`: 结果文件存储路径
177-
- `task_name`:任务(数据集)名称,如SciFact
178-
- `task_split`:测试查询集合,如test或dev
179-
- `query_instruction`:查询前添加的提示文本,如'query: '或None
180-
- `document_instruction`:文档前添加的提示文本,如'passage: '或None
181-
- `pooling_method`:获取表示的方式,last表示取最后token,mean表示取平均,cls表示取`[CLS]`token
199+
- `task_name`:任务(数据集)名称,如 SciFact
200+
- `task_split`:测试查询集合,如 test 或 dev
201+
- `query_instruction`:查询前添加的提示文本,如'query: '或 None
202+
- `document_instruction`:文档前添加的提示文本,如'passage: '或 None
203+
- `pooling_method`:获取表示的方式,last 表示取最后 token,mean 表示取平均,cls 表示取`[CLS]`token
182204
- `max_seq_length`: 最大序列长度
183-
- `eval_batch_size`: 模型预测的批次大小(单个GPU
184-
- `pad_token`设置padding的token,可取unk_token、eos_token或pad_token
185-
- `padding_side`设置padding的位置,可取left或right
205+
- `eval_batch_size`: 模型预测的批次大小(单个 GPU
206+
- `pad_token`设置 padding 的 token,可取 unk_token、eos_token 或 pad_token
207+
- `padding_side`设置 padding 的位置,可取 left 或 right
186208
- `add_bos_token`:是否添加起始符,0表示不添加,1表示添加
187209
- `add_eos_token`:是否添加结束符,0表示不添加,1表示添加
188210

legacy/pipelines/examples/contrastive_training/evaluation/mteb/eval_mteb.py

Lines changed: 65 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -15,15 +15,28 @@
1515
import argparse
1616
import logging
1717

18+
import mteb
19+
import paddle
20+
from evaluation.mteb.mteb_models_nv import NVEncodeModel
1821
from mteb import MTEB
1922
from mteb_models import EncodeModel
2023

21-
from paddlenlp.transformers import AutoModel, AutoTokenizer
24+
from paddlenlp.peft import LoRAConfig, LoRAModel
25+
from paddlenlp.transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer
2226

2327

2428
def get_model(peft_model_name, base_model_name):
2529
if peft_model_name is not None:
26-
raise NotImplementedError("PEFT model is not supported yet")
30+
base_model = AutoModelForCausalLM.from_pretrained(base_model_name, dtype="bfloat16")
31+
lora_config = LoRAConfig.from_pretrained(peft_model_name)
32+
lora_config.merge_weights = True
33+
lora_weights = paddle.load(peft_model_name + "/lora_model_state.pdparams")
34+
k = list(lora_weights.keys())[0]
35+
assert k.startswith(
36+
"llama."
37+
), f"You Must Manually Replace 'model' to 'llama'. Please Refer to do_replace_model_llama.py"
38+
model = LoRAModel.from_pretrained(base_model, peft_model_name, lora_config=lora_config, dtype="bfloat16")
39+
return model
2740
else:
2841
base_model = AutoModel.from_pretrained(base_model_name)
2942
return base_model
@@ -67,39 +80,58 @@ def get_args():
6780
logging.basicConfig(level=logging.INFO)
6881
logger.info("Args: {}".format(args))
6982

70-
model = get_model(args.peft_model_name_or_path, args.base_model_name_or_path)
71-
72-
tokenizer = AutoTokenizer.from_pretrained(args.base_model_name_or_path)
73-
assert hasattr(tokenizer, args.pad_token), f"Tokenizer does not have {args.pad_token} token"
74-
token_dict = {"unk_token": tokenizer.unk_token, "eos_token": tokenizer.eos_token, "pad_token": tokenizer.pad_token}
75-
tokenizer.pad_token = token_dict[args.pad_token]
76-
77-
assert args.padding_side in [
78-
"right",
79-
"left",
80-
], f"padding_side should be either 'right' or 'left', but got {args.padding_side}"
81-
assert not (
82-
args.padding_side == "left" and args.pooling_method == "cls"
83-
), "Padding 'left' is not supported for pooling method 'cls'"
84-
tokenizer.padding_side = args.padding_side
85-
86-
assert args.add_bos_token in [0, 1], f"add_bos_token should be either 0 or 1, but got {args.add_bos_token}"
87-
assert args.add_eos_token in [0, 1], f"add_eos_token should be either 0 or 1, but got {args.add_eos_token}"
88-
tokenizer.add_bos_token = bool(args.add_bos_token)
89-
tokenizer.add_eos_token = bool(args.add_eos_token)
90-
91-
encode_model = EncodeModel(
92-
model=model,
93-
tokenizer=tokenizer,
94-
pooling_method=args.pooling_method,
95-
query_instruction=args.query_instruction,
96-
document_instruction=args.document_instruction,
97-
eval_batch_size=args.eval_batch_size,
98-
max_seq_length=args.max_seq_length,
99-
)
83+
if "NV-Embed" in args.base_model_name_or_path:
84+
logger.info("Using NV-Embed")
85+
86+
query_prefix = "Instruct: " + args.query_instruction + "\nQuery: "
87+
passage_prefix = ""
88+
89+
if args.task_name == "QuoraRetrieval":
90+
assert args.document_instruction != "document: ", f"QuoraRetrieval requires a document instruction"
91+
passage_prefix = "Instruct: " + args.document_instruction + "\nQuery: " # because this is STS task
92+
93+
encode_model = NVEncodeModel.from_pretrained(
94+
args.base_model_name_or_path,
95+
tokenizer_path=args.base_model_name_or_path,
96+
eval_batch_size=args.eval_batch_size,
97+
query_instruction=query_prefix,
98+
document_instruction=passage_prefix,
99+
dtype="float16",
100+
)
101+
encode_model.eval()
102+
103+
else:
104+
model = get_model(args.peft_model_name_or_path, args.base_model_name_or_path)
105+
106+
assert args.add_bos_token in [0, 1], f"add_bos_token should be either 0 or 1, but got {args.add_bos_token}"
107+
assert args.add_eos_token in [0, 1], f"add_eos_token should be either 0 or 1, but got {args.add_eos_token}"
108+
tokenizer = AutoTokenizer.from_pretrained(args.base_model_name_or_path)
109+
assert hasattr(tokenizer, args.pad_token), f"Tokenizer does not have {args.pad_token} token"
110+
token_dict = {"unk_token": tokenizer.unk_token, "eos_token": tokenizer.eos_token}
111+
tokenizer.pad_token = token_dict[args.pad_token]
112+
assert args.padding_side in [
113+
"right",
114+
"left",
115+
], f"padding_side should be either 'right' or 'left', but got {args.padding_side}"
116+
assert not (
117+
args.padding_side == "left" and args.pooling_method == "cls"
118+
), "Padding 'left' is not supported for pooling method 'cls'"
119+
tokenizer.padding_side = args.padding_side
120+
tokenizer.add_bos_token = bool(args.add_bos_token)
121+
tokenizer.add_eos_token = bool(args.add_eos_token)
122+
123+
encode_model = EncodeModel(
124+
model=model,
125+
tokenizer=tokenizer,
126+
pooling_method=args.pooling_method,
127+
query_instruction=args.query_instruction,
128+
document_instruction=args.document_instruction,
129+
eval_batch_size=args.eval_batch_size,
130+
max_seq_length=args.max_seq_length,
131+
)
100132

101133
logger.info("Ready to eval")
102-
evaluation = MTEB(tasks=[args.task_name])
134+
evaluation = MTEB(tasks=mteb.get_tasks(tasks=[args.task_name]))
103135
evaluation.run(
104136
encode_model,
105137
output_folder=f"{args.output_folder}/{args.task_name}/{args.pooling_method}",

0 commit comments

Comments
 (0)