Skip to content

Commit 3c7b524

Browse files
authored
support sharegpt dataset format (modelscope#1052)
1 parent 7a4eb5d commit 3c7b524

File tree

9 files changed

+82
-48
lines changed

9 files changed

+82
-48
lines changed

docs/source/LLM/命令行参数.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@
3535
- `--dataset`: 用于选择训练的数据集, 默认为`[]`. 可以选择的数据集可以查看[支持的数据集](支持的模型和数据集.md#数据集). 如果需要使用多个数据集进行训练, 你可以使用','或者' '进行分割, 例如: `--dataset alpaca-en,alpaca-zh` or `--dataset alpaca-en alpaca-zh`. 支持Modelscope Hub/HuggingFace Hub/本地路径、subsets选择与数据集采样, 每个数据集指定格式如下: `[HF or MS::]{dataset_name} or {dataset_id} or {dataset_path}[:subset1/subset2/...][#dataset_sample]`, 最简只需要指定dataset_name、dataset_id或者dataset_path即可. 自定义数据集可以查看[数据集的自定义与拓展文档](自定义与拓展.md#自定义数据集).
3636
- 支持MS和HF hub, 以及dataset_sample的支持. e.g. 'MS::alpaca-zh#2000', 'HF::jd-sentiment-zh#2000' (默认使用的hub, 由`USE_UF`环境变量控制, 默认MS).
3737
- 对subsets更细粒度的控制: 默认使用注册时指定的subsets(注册时未指定则使用'default'). e.g. 'sharegpt-gpt4'. 如果指定subsets则使用对应子集的数据集. e.g. 'sharegpt-gpt4:default/V3_format#2000'. 使用'/'进行分隔.
38-
- dataset_id的支持. e.g. 'AI-ModelScope/alpaca-gpt4-data-zh#2000', 'HF::llm-wizard/alpaca-gpt4-data-zh#2000', 'hurner/alpaca-gpt4-data-zh#2000', 'HF::shibing624/alpaca-zh#2000'. 如果dataset_id已经注册,则会使用注册时的预处理函数、subsets、split等. 否则使用`SmartPreprocessor`, 支持4种数据集格式, 并使用'default'的subsets, split设置为'train'. 支持的数据集格式可以查看[数据集的自定义与拓展文档](自定义与拓展.md#自定义数据集).
38+
- dataset_id的支持. e.g. 'AI-ModelScope/alpaca-gpt4-data-zh#2000', 'HF::llm-wizard/alpaca-gpt4-data-zh#2000', 'hurner/alpaca-gpt4-data-zh#2000', 'HF::shibing624/alpaca-zh#2000'. 如果dataset_id已经注册,则会使用注册时的预处理函数、subsets、split等. 否则使用`SmartPreprocessor`, 支持5种数据集格式, 并使用'default'的subsets, split设置为'train'. 支持的数据集格式可以查看[数据集的自定义与拓展文档](自定义与拓展.md#自定义数据集).
3939
- dataset_path的支持. e.g. '1.jsonl#5000'. (如果是相对路径,则为相对于运行目录的相对路径).
4040
- `--val_dataset`: 用于指定单独的验证集, 格式和`dataset`参数相同, 默认为`[]`. 如果使用本参数, 则`dataset_test_ratio`不再生效.
4141
- `--dataset_seed`: 用于指定数据集处理的seed, 默认为`42`. 以random_state形式存在, 不影响全局seed.
@@ -294,7 +294,7 @@ export参数继承了infer参数, 除此之外增加了以下参数:
294294
- `--dataset`: 该参数已在InferArguments中定义, 在export时含义为量化数据集. 默认为`[]`. 更多细节: 包括如何自定义量化数据集, 可以参考[LLM量化文档](LLM量化文档.md).
295295
- `--quant_n_samples`: 量化参数, 默认为`256`. 当设置为`--quant_method awq`时, 如果出现量化的时候OOM, 可以适度降低`--quant_n_samples``--quant_seqlen`. `--quant_method gptq`通常不会出现量化OOM.
296296
- `--quant_seqlen`: 量化参数, 默认为`2048`.
297-
- `--quant_device_map`: 默认为`'cpu'`, 节约显存. 你可以指定为'cuda:0', 'auto', 'cpu'等, 表示量化时模型导入的设备.
297+
- `--quant_device_map`: 默认为`'cpu'`, 节约显存. 你可以指定为'cuda:0', 'auto', 'cpu'等, 表示量化时模型导入的设备. 该参数与实际执行量化的设备无关, 例如awq和gptq会在cuda:0中进行量化.
298298
- `--quant_output_dir`: 默认为`None`, 默认的quant_output_dir会被打印在命令行中.
299299
- `--push_to_hub`: 默认为`False`. 是否将最后的`ckpt_dir`push到ModelScope Hub中. 如果你指定了`merge_lora`, 则将推送全量参数; 如果你还指定了`quant_bits`, 则将推送量化后的模型.
300300
- `--hub_model_id`: 默认为`None`. 推送到的ModelScope Hub的model_id. 如果`push_to_hub`设置为True, 该参数必须被设置.

docs/source/LLM/自定义与拓展.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,14 @@ system,query,response
8787

8888
**格式4:**
8989

90+
```jsonl
91+
{"system": "00000", "conversation": [{"human": "11111", "assistant": "22222"}]}
92+
{"conversation": [{"human": "aaaaa", "assistant": "bbbbb"}]}
93+
{"conversation": [{"human": "AAAAA", "assistant": "BBBBB"}, {"human": "CCCCC", "assistant": "DDDDD"}, {"human": "EEEEE", "assistant": "FFFFF"}]}
94+
```
95+
96+
**格式5:**
97+
9098
```csv
9199
system,instruction,input,output
92100
00000,11111,22222,33333

docs/source_en/LLM/Command-line-parameters.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@
3333
- `--dataset`: Used to select the training dataset, default is `[]`. You can see the list of available datasets [here](Supported-models-datasets.md#Datasets). If you need to train with multiple datasets, you can use ',' or ' ' to separate them, for example: `--dataset alpaca-en,alpaca-zh` or `--dataset alpaca-en alpaca-zh`. It supports Modelscope Hub/HuggingFace Hub/local paths, subset selection, and dataset sampling. The specified format for each dataset is as follows: `[HF or MS::]{dataset_name} or {dataset_id} or {dataset_path}[:subset1/subset2/...][#dataset_sample]`. The simplest case requires specifying only dataset_name, dataset_id, or dataset_path. Customizing datasets can be found in the [Customizing and Extending Datasets document](Customization.md#custom-dataset)
3434
- Supports MS and HF hub, as well as dataset_sample. For example, 'MS::alpaca-zh#2000', 'HF::jd-sentiment-zh#2000' (the default hub used is controlled by the `USE_UF` environment variable, default is MS).
3535
- More fine-grained control over subsets: It uses the subsets specified during registration by default (if not specified during registration, it uses 'default'). For example, 'sharegpt-gpt4'. If subsets are specified, it uses the corresponding subset of the dataset. For example, 'sharegpt-gpt4:default/V3_format#2000'. Separated by '/'.
36-
- Support for dataset_id. For example, 'AI-ModelScope/alpaca-gpt4-data-zh#2000', 'HF::llm-wizard/alpaca-gpt4-data-zh#2000', 'hurner/alpaca-gpt4-data-zh#2000', 'HF::shibing624/alpaca-zh#2000'. If the dataset_id has been registered, it will use the preprocessing function, subsets, split, etc. specified during registration. Otherwise, it will use `SmartPreprocessor`, support 4 dataset formats, and use 'default' subsets, with split set to 'train'. The supported dataset formats can be found in the [Customizing and Extending Datasets document](Customization.md#custom-dataset).
36+
- Support for dataset_id. For example, 'AI-ModelScope/alpaca-gpt4-data-zh#2000', 'HF::llm-wizard/alpaca-gpt4-data-zh#2000', 'hurner/alpaca-gpt4-data-zh#2000', 'HF::shibing624/alpaca-zh#2000'. If the dataset_id has been registered, it will use the preprocessing function, subsets, split, etc. specified during registration. Otherwise, it will use `SmartPreprocessor`, support 5 dataset formats, and use 'default' subsets, with split set to 'train'. The supported dataset formats can be found in the [Customizing and Extending Datasets document](Customization.md#custom-dataset).
3737
- Support for dataset_path. For example, '1.jsonl#5000' (if it is a relative path, it is relative to the running directory).
3838
- `--val_dataset`: Specify separate validation datasets with the same format of the `dataset` argument, default is `[]`. If using `val_dataset`, the `dataset_test_ratio` will be ignored.
3939
- `--dataset_seed`: Seed for dataset processing, default is `42`. Exists as random_state, does not affect global seed.
@@ -292,7 +292,7 @@ export parameters inherit from infer parameters, with the following added parame
292292
- `--dataset`: This parameter is already defined in InferArguments, for export it means quantization dataset. Default is `[]`. More details: including how to customize quantization dataset, can be found in [LLM Quantization Documentation](LLM-quantization.md).
293293
- `--quant_n_samples`: Quantization parameter, default is `256`. When set to `--quant_method awq`, if OOM occurs during quantization, you can moderately reduce `--quant_n_samples` and `--quant_seqlen`. `--quant_method gptq` generally does not encounter quantization OOM.
294294
- `--quant_seqlen`: Quantization parameter, default is `2048`.
295-
- `--quant_device_map`: Default is `'cpu'`, to save memory. You can specify 'cuda:0', 'auto', 'cpu', etc., representing the device to load model during quantization.
295+
- `--quant_device_map`: Default is `'cpu'`, to save memory. You can specify 'cuda:0', 'auto', 'cpu', etc., representing the device to load model during quantization. This parameter is independent of the actual device that performs the quantization, such as AWQ and GPTQ which will carry out quantization on cuda:0.
296296
- `quant_output_dir`: Default is `None`, the default quant_output_dir will be printed in the command line.
297297
- `--push_to_hub`: Default is `False`. Whether to push the final `ckpt_dir` to ModelScope Hub. If you specify `merge_lora`, full parameters will be pushed; if you also specify `quant_bits`, quantized model will be pushed.
298298
- `--hub_model_id`: Default is `None`. Model_id to push to on ModelScope Hub. If `push_to_hub` is set to True, this parameter must be set.

docs/source_en/LLM/Customization.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,14 @@ Multi-Round Dialogue
9393

9494
**Format 4:**
9595

96+
```jsonl
97+
{"system": "00000", "conversation": [{"human": "11111", "assistant": "22222"}]}
98+
{"conversation": [{"human": "aaaaa", "assistant": "bbbbb"}]}
99+
{"conversation": [{"human": "AAAAA", "assistant": "BBBBB"}, {"human": "CCCCC", "assistant": "DDDDD"}, {"human": "EEEEE", "assistant": "FFFFF"}]}
100+
```
101+
102+
**Format 5:**
103+
96104
```csv
97105
system,instruction,input,output
98106
00000,11111,22222,33333

swift/llm/utils/dataset.py

Lines changed: 14 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,8 @@
2020

2121
from swift.utils import get_logger, get_seed, is_dist, is_local_master, read_from_jsonl, transform_jsonl_to_df
2222
from .preprocess import (AlpacaPreprocessor, ClsPreprocessor, ComposePreprocessor, ConversationsPreprocessor,
23-
PreprocessFunc, RenameColumnsPreprocessor, SmartPreprocessor, TextGenerationPreprocessor)
23+
PreprocessFunc, RenameColumnsPreprocessor, SmartPreprocessor, TextGenerationPreprocessor,
24+
preprocess_sharegpt)
2425
from .template import History
2526
from .utils import download_dataset
2627

@@ -213,21 +214,26 @@ def register_local_dataset(
213214

214215

215216
def register_dataset_info(dataset_name: str, d_info: Dict[str, Any], **kwargs) -> None:
216-
if 'dataset_path' in d_info:
217-
base_dir = kwargs.pop('base_dir', None)
218-
register_local_dataset(dataset_name, d_info.pop('dataset_path', None), base_dir, **d_info)
219-
return
220-
221-
assert 'dataset_id' in d_info or 'hf_dataset_id' in d_info
222217
preprocess_func = None
223218
if 'columns' in d_info:
224219
preprocess_func = RenameColumnsPreprocessor(d_info['columns'])
225220
d_info.pop('columns')
221+
d_info['preprocess_func'] = preprocess_func
226222
elif 'conversations' in d_info:
227223
preprocess_func = ConversationsPreprocessor(**d_info['conversations'])
228224
d_info.pop('conversations')
225+
d_info['preprocess_func'] = preprocess_func
226+
227+
if 'dataset_path' in d_info:
228+
base_dir = kwargs.pop('base_dir', None)
229+
register_local_dataset(dataset_name, d_info.pop('dataset_path', None), base_dir, **d_info)
230+
return
231+
232+
assert 'dataset_id' in d_info or 'hf_dataset_id' in d_info
233+
229234
dataset_id = d_info.pop('dataset_id', None)
230235
subsets = d_info.pop('subsets', None)
236+
preprocess_func = d_info.pop('preprocess_func', None)
231237
register_dataset(dataset_name, dataset_id, subsets, preprocess_func, get_dataset_from_repo, **d_info, exist_ok=True)
232238

233239

@@ -809,30 +815,10 @@ def reorganize_row(row):
809815
get_dataset_from_repo,
810816
tags=['rlhf', 'dpo', 'pairwise'])
811817

812-
813-
def _preprocess_sharegpt(dataset: HfDataset) -> HfDataset:
814-
query = []
815-
response = []
816-
history: List[History] = []
817-
for d in tqdm(dataset):
818-
if isinstance(d['conversation'], str):
819-
try:
820-
conversation = ast.literal_eval(d['conversation'])
821-
except SyntaxError:
822-
continue
823-
query.append(conversation[-1]['human'])
824-
response.append(conversation[-1]['assistant'])
825-
h = []
826-
for c in conversation[:-1]:
827-
h.append([c['human'], c['assistant']])
828-
history.append(h)
829-
return HfDataset.from_dict({'query': query, 'response': response, 'history': history})
830-
831-
832818
register_dataset(
833819
DatasetName.sharegpt,
834820
'huangjintao/sharegpt', ['common-zh', 'computer-zh', 'unknow-zh', 'common-en', 'computer-en'],
835-
_preprocess_sharegpt,
821+
preprocess_sharegpt,
836822
get_dataset_from_repo,
837823
tags=['chat', 'general', 'multi-round'])
838824

swift/llm/utils/preprocess.py

Lines changed: 41 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -144,7 +144,7 @@ def __call__(self, dataset: HfDataset) -> HfDataset:
144144
'query': query,
145145
'response': response,
146146
})
147-
dataset = HfDataset.from_dict({**kwargs})
147+
dataset = HfDataset.from_dict(kwargs)
148148
return dataset
149149

150150

@@ -170,6 +170,41 @@ def __call__(self, dataset: HfDataset) -> HfDataset:
170170
return dataset
171171

172172

173+
def preprocess_sharegpt(dataset: HfDataset) -> HfDataset:
174+
query = []
175+
response = []
176+
system: List[Optional[str]] = []
177+
has_system = False
178+
history: List[History] = []
179+
has_history = False
180+
for d in tqdm(dataset):
181+
if isinstance(d['conversation'], str):
182+
try:
183+
conversation = ast.literal_eval(d['conversation'])
184+
except SyntaxError:
185+
continue
186+
else:
187+
conversation = d['conversation']
188+
query.append(conversation[-1]['human'])
189+
response.append(conversation[-1]['assistant'])
190+
h = []
191+
for c in conversation[:-1]:
192+
h.append([c['human'], c['assistant']])
193+
if len(h) > 0:
194+
has_history = True
195+
history.append(h)
196+
sys = d.get('system')
197+
if sys is not None:
198+
has_system = True
199+
system.append(sys)
200+
kwargs = {'query': query, 'response': response}
201+
if has_history:
202+
kwargs['history'] = history
203+
if has_system:
204+
kwargs['system'] = system
205+
return HfDataset.from_dict(kwargs)
206+
207+
173208
class SmartPreprocessor:
174209

175210
def __init__(self) -> None:
@@ -182,14 +217,18 @@ def __init__(self) -> None:
182217
'required': ['instruction', 'output'],
183218
'preprocessor': AlpacaPreprocessor()
184219
},
185-
'conversations': {
220+
'conversations': { # qwen
186221
'required': ['conversations'],
187222
'preprocessor': ConversationsPreprocessor()
188223
},
189224
'chatml': {
190225
'required': ['messages'],
191226
'preprocessor':
192227
ConversationsPreprocessor(conversations_key='messages', from_key='role', value_key='content')
228+
},
229+
'sharegpt': {
230+
'required': ['conversation'],
231+
'preprocessor': preprocess_sharegpt
193232
}
194233
}
195234

swift/ui/llm_infer/llm_infer.py

Lines changed: 2 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -362,26 +362,15 @@ def send_message(cls, running_task, model_and_template, template_type, prompt: s
362362
stream_resp_with_history = ''
363363
if not template_type.endswith('generation'):
364364
stream_resp = inference_client(
365-
model_type,
366-
prompt,
367-
old_history,
368-
system=system,
369-
port=args['port'],
370-
adapter_name='default-lora' if sft_type in ('lora', 'longlora') else None,
371-
request_config=request_config)
365+
model_type, prompt, old_history, system=system, port=args['port'], request_config=request_config)
372366
for chunk in stream_resp:
373367
stream_resp_with_history += chunk.choices[0].delta.content
374368
qr_pair = [prompt, stream_resp_with_history]
375369
total_history = old_history + [qr_pair]
376370
yield '', total_history
377371
else:
378372
request_config.max_tokens = max_new_tokens
379-
stream_resp = inference_client(
380-
model_type,
381-
prompt,
382-
port=args['port'],
383-
adapter_name='default-lora' if sft_type in ('lora', 'longlora') else None,
384-
request_config=request_config)
373+
stream_resp = inference_client(model_type, prompt, port=args['port'], request_config=request_config)
385374
for chunk in stream_resp:
386375
stream_resp_with_history += chunk.choices[0].text
387376
qr_pair = [prompt, stream_resp_with_history]

tests/llm/data/sharegpt.jsonl

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
{"system": "00000", "conversation": [{"human": "11111", "assistant": "22222"}]}
2+
{"conversation": [{"human": "aaaaa", "assistant": "bbbbb"}]}
3+
{"conversation": [{"human": "AAAAA", "assistant": "BBBBB"}, {"human": "CCCCC", "assistant": "DDDDD"}, {"human": "EEEEE", "assistant": "FFFFF"}]}

tests/llm/test_run.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -206,7 +206,8 @@ def test_custom_dataset(self):
206206
# ignore citest error in github
207207
return
208208
train_dataset_fnames = [
209-
'alpaca.csv', 'chatml.jsonl', 'swift_pre.jsonl', 'swift_single.csv', 'swift_multi.jsonl', 'swift_multi.json'
209+
'alpaca.csv', 'chatml.jsonl', 'swift_pre.jsonl', 'swift_single.csv', 'swift_multi.jsonl',
210+
'swift_multi.json', 'sharegpt.jsonl'
210211
]
211212
val_dataset_fnames = [
212213
'alpaca.jsonl', 'alpaca2.csv', 'conversations.jsonl', 'swift_pre.csv', 'swift_single.jsonl'

0 commit comments

Comments
 (0)