support sharegpt dataset format (modelscope#1052)

Jintao-Huang · web-flow · commit 3c7b52462486 · 2024-06-03T19:02:55.000+08:00
diff --git a/docs/source/LLM/命令行参数.md b/docs/source/LLM/命令行参数.md
@@ -35,7 +35,7 @@
 - `--dataset`: 用于选择训练的数据集, 默认为`[]`. 可以选择的数据集可以查看[支持的数据集](支持的模型和数据集.md#数据集). 如果需要使用多个数据集进行训练, 你可以使用','或者' '进行分割, 例如: `--dataset alpaca-en,alpaca-zh` or `--dataset alpaca-en alpaca-zh`. 支持Modelscope Hub/HuggingFace Hub/本地路径、subsets选择与数据集采样, 每个数据集指定格式如下: `[HF or MS::]{dataset_name} or {dataset_id} or {dataset_path}[:subset1/subset2/...][#dataset_sample]`, 最简只需要指定dataset_name、dataset_id或者dataset_path即可. 自定义数据集可以查看[数据集的自定义与拓展文档](自定义与拓展.md#自定义数据集).
    - 支持MS和HF hub, 以及dataset_sample的支持. e.g. 'MS::alpaca-zh#2000', 'HF::jd-sentiment-zh#2000' (默认使用的hub, 由`USE_UF`环境变量控制, 默认MS).
    - 对subsets更细粒度的控制: 默认使用注册时指定的subsets(注册时未指定则使用'default'). e.g. 'sharegpt-gpt4'. 如果指定subsets则使用对应子集的数据集. e.g. 'sharegpt-gpt4:default/V3_format#2000'. 使用'/'进行分隔.
-   - dataset_id的支持. e.g. 'AI-ModelScope/alpaca-gpt4-data-zh#2000', 'HF::llm-wizard/alpaca-gpt4-data-zh#2000', 'hurner/alpaca-gpt4-data-zh#2000', 'HF::shibing624/alpaca-zh#2000'. 如果dataset_id已经注册，则会使用注册时的预处理函数、subsets、split等. 否则使用`SmartPreprocessor`, 支持4种数据集格式, 并使用'default'的subsets, split设置为'train'. 支持的数据集格式可以查看[数据集的自定义与拓展文档](自定义与拓展.md#自定义数据集).
+   - dataset_id的支持. e.g. 'AI-ModelScope/alpaca-gpt4-data-zh#2000', 'HF::llm-wizard/alpaca-gpt4-data-zh#2000', 'hurner/alpaca-gpt4-data-zh#2000', 'HF::shibing624/alpaca-zh#2000'. 如果dataset_id已经注册，则会使用注册时的预处理函数、subsets、split等. 否则使用`SmartPreprocessor`, 支持5种数据集格式, 并使用'default'的subsets, split设置为'train'. 支持的数据集格式可以查看[数据集的自定义与拓展文档](自定义与拓展.md#自定义数据集).
    - dataset_path的支持. e.g. '1.jsonl#5000'. (如果是相对路径，则为相对于运行目录的相对路径).
 - `--val_dataset`: 用于指定单独的验证集, 格式和`dataset`参数相同, 默认为`[]`. 如果使用本参数, 则`dataset_test_ratio`不再生效.
 - `--dataset_seed`: 用于指定数据集处理的seed, 默认为`42`. 以random_state形式存在, 不影响全局seed.
@@ -294,7 +294,7 @@ export参数继承了infer参数, 除此之外增加了以下参数:
 - `--dataset`: 该参数已在InferArguments中定义, 在export时含义为量化数据集. 默认为`[]`. 更多细节: 包括如何自定义量化数据集, 可以参考[LLM量化文档](LLM量化文档.md).
 - `--quant_n_samples`: 量化参数, 默认为`256`. 当设置为`--quant_method awq`时, 如果出现量化的时候OOM, 可以适度降低`--quant_n_samples`和`--quant_seqlen`. `--quant_method gptq`通常不会出现量化OOM.
 - `--quant_seqlen`: 量化参数, 默认为`2048`.
-- `--quant_device_map`: 默认为`'cpu'`, 节约显存. 你可以指定为'cuda:0', 'auto', 'cpu'等, 表示量化时模型导入的设备.
+- `--quant_device_map`: 默认为`'cpu'`, 节约显存. 你可以指定为'cuda:0', 'auto', 'cpu'等, 表示量化时模型导入的设备. 该参数与实际执行量化的设备无关, 例如awq和gptq会在cuda:0中进行量化.
 - `--quant_output_dir`: 默认为`None`, 默认的quant_output_dir会被打印在命令行中.
 - `--push_to_hub`: 默认为`False`. 是否将最后的`ckpt_dir`push到ModelScope Hub中. 如果你指定了`merge_lora`, 则将推送全量参数; 如果你还指定了`quant_bits`, 则将推送量化后的模型.
 - `--hub_model_id`: 默认为`None`. 推送到的ModelScope Hub的model_id. 如果`push_to_hub`设置为True, 该参数必须被设置.
diff --git a/docs/source/LLM/自定义与拓展.md b/docs/source/LLM/自定义与拓展.md
@@ -87,6 +87,14 @@ system,query,response
 
 **格式4:**
 
+```jsonl
+{"system": "00000", "conversation": [{"human": "11111", "assistant": "22222"}]}
+{"conversation": [{"human": "aaaaa", "assistant": "bbbbb"}]}
+{"conversation": [{"human": "AAAAA", "assistant": "BBBBB"}, {"human": "CCCCC", "assistant": "DDDDD"}, {"human": "EEEEE", "assistant": "FFFFF"}]}
+```
+
+**格式5:**
+
 ```csv
 system,instruction,input,output
 00000,11111,22222,33333
diff --git a/docs/source_en/LLM/Command-line-parameters.md b/docs/source_en/LLM/Command-line-parameters.md
@@ -33,7 +33,7 @@
 - `--dataset`: Used to select the training dataset, default is `[]`. You can see the list of available datasets [here](Supported-models-datasets.md#Datasets). If you need to train with multiple datasets, you can use ',' or ' ' to separate them, for example: `--dataset alpaca-en,alpaca-zh` or `--dataset alpaca-en alpaca-zh`. It supports Modelscope Hub/HuggingFace Hub/local paths, subset selection, and dataset sampling. The specified format for each dataset is as follows: `[HF or MS::]{dataset_name} or {dataset_id} or {dataset_path}[:subset1/subset2/...][#dataset_sample]`. The simplest case requires specifying only dataset_name, dataset_id, or dataset_path. Customizing datasets can be found in the [Customizing and Extending Datasets document](Customization.md#custom-dataset)
   - Supports MS and HF hub, as well as dataset_sample. For example, 'MS::alpaca-zh#2000', 'HF::jd-sentiment-zh#2000' (the default hub used is controlled by the `USE_UF` environment variable, default is MS).
   - More fine-grained control over subsets: It uses the subsets specified during registration by default (if not specified during registration, it uses 'default'). For example, 'sharegpt-gpt4'. If subsets are specified, it uses the corresponding subset of the dataset. For example, 'sharegpt-gpt4:default/V3_format#2000'. Separated by '/'.
-  - Support for dataset_id. For example, 'AI-ModelScope/alpaca-gpt4-data-zh#2000', 'HF::llm-wizard/alpaca-gpt4-data-zh#2000', 'hurner/alpaca-gpt4-data-zh#2000', 'HF::shibing624/alpaca-zh#2000'. If the dataset_id has been registered, it will use the preprocessing function, subsets, split, etc. specified during registration. Otherwise, it will use `SmartPreprocessor`, support 4 dataset formats, and use 'default' subsets, with split set to 'train'. The supported dataset formats can be found in the [Customizing and Extending Datasets document](Customization.md#custom-dataset).
+  - Support for dataset_id. For example, 'AI-ModelScope/alpaca-gpt4-data-zh#2000', 'HF::llm-wizard/alpaca-gpt4-data-zh#2000', 'hurner/alpaca-gpt4-data-zh#2000', 'HF::shibing624/alpaca-zh#2000'. If the dataset_id has been registered, it will use the preprocessing function, subsets, split, etc. specified during registration. Otherwise, it will use `SmartPreprocessor`, support 5 dataset formats, and use 'default' subsets, with split set to 'train'. The supported dataset formats can be found in the [Customizing and Extending Datasets document](Customization.md#custom-dataset).
   - Support for dataset_path. For example, '1.jsonl#5000' (if it is a relative path, it is relative to the running directory).
 - `--val_dataset`: Specify separate validation datasets with the same format of the `dataset` argument, default is `[]`. If using `val_dataset`, the `dataset_test_ratio` will be ignored.
 - `--dataset_seed`: Seed for dataset processing, default is `42`. Exists as random_state, does not affect global seed.
@@ -292,7 +292,7 @@ export parameters inherit from infer parameters, with the following added parame
 - `--dataset`: This parameter is already defined in InferArguments, for export it means quantization dataset. Default is `[]`. More details: including how to customize quantization dataset, can be found in [LLM Quantization Documentation](LLM-quantization.md).
 - `--quant_n_samples`: Quantization parameter, default is `256`. When set to `--quant_method awq`, if OOM occurs during quantization, you can moderately reduce `--quant_n_samples` and `--quant_seqlen`. `--quant_method gptq` generally does not encounter quantization OOM.
 - `--quant_seqlen`: Quantization parameter, default is `2048`.
-- `--quant_device_map`: Default is `'cpu'`, to save memory. You can specify 'cuda:0', 'auto', 'cpu', etc., representing the device to load model during quantization.
+- `--quant_device_map`: Default is `'cpu'`, to save memory. You can specify 'cuda:0', 'auto', 'cpu', etc., representing the device to load model during quantization. This parameter is independent of the actual device that performs the quantization, such as AWQ and GPTQ which will carry out quantization on cuda:0.
 - `quant_output_dir`: Default is `None`, the default quant_output_dir will be printed in the command line.
 - `--push_to_hub`: Default is `False`. Whether to push the final `ckpt_dir` to ModelScope Hub. If you specify `merge_lora`, full parameters will be pushed; if you also specify `quant_bits`, quantized model will be pushed.
 - `--hub_model_id`: Default is `None`. Model_id to push to on ModelScope Hub. If `push_to_hub` is set to True, this parameter must be set.
diff --git a/docs/source_en/LLM/Customization.md b/docs/source_en/LLM/Customization.md
@@ -93,6 +93,14 @@ Multi-Round Dialogue
 
 **Format 4:**
 
+```jsonl
+{"system": "00000", "conversation": [{"human": "11111", "assistant": "22222"}]}
+{"conversation": [{"human": "aaaaa", "assistant": "bbbbb"}]}
+{"conversation": [{"human": "AAAAA", "assistant": "BBBBB"}, {"human": "CCCCC", "assistant": "DDDDD"}, {"human": "EEEEE", "assistant": "FFFFF"}]}
+```
+
+**Format 5:**
+
 ```csv
 system,instruction,input,output
 00000,11111,22222,33333
diff --git a/swift/llm/utils/dataset.py b/swift/llm/utils/dataset.py
@@ -20,7 +20,8 @@
 
 from swift.utils import get_logger, get_seed, is_dist, is_local_master, read_from_jsonl, transform_jsonl_to_df
 from .preprocess import (AlpacaPreprocessor, ClsPreprocessor, ComposePreprocessor, ConversationsPreprocessor,
-                         PreprocessFunc, RenameColumnsPreprocessor, SmartPreprocessor, TextGenerationPreprocessor)
+                         PreprocessFunc, RenameColumnsPreprocessor, SmartPreprocessor, TextGenerationPreprocessor,
+                         preprocess_sharegpt)
 from .template import History
 from .utils import download_dataset
 
@@ -213,21 +214,26 @@ def register_local_dataset(
 
 
 def register_dataset_info(dataset_name: str, d_info: Dict[str, Any], **kwargs) -> None:
-    if 'dataset_path' in d_info:
-        base_dir = kwargs.pop('base_dir', None)
-        register_local_dataset(dataset_name, d_info.pop('dataset_path', None), base_dir, **d_info)
-        return
-
-    assert 'dataset_id' in d_info or 'hf_dataset_id' in d_info
     preprocess_func = None
     if 'columns' in d_info:
         preprocess_func = RenameColumnsPreprocessor(d_info['columns'])
         d_info.pop('columns')
+        d_info['preprocess_func'] = preprocess_func
     elif 'conversations' in d_info:
         preprocess_func = ConversationsPreprocessor(**d_info['conversations'])
         d_info.pop('conversations')
+        d_info['preprocess_func'] = preprocess_func
+
+    if 'dataset_path' in d_info:
+        base_dir = kwargs.pop('base_dir', None)
+        register_local_dataset(dataset_name, d_info.pop('dataset_path', None), base_dir, **d_info)
+        return
+
+    assert 'dataset_id' in d_info or 'hf_dataset_id' in d_info
+
     dataset_id = d_info.pop('dataset_id', None)
     subsets = d_info.pop('subsets', None)
+    preprocess_func = d_info.pop('preprocess_func', None)
     register_dataset(dataset_name, dataset_id, subsets, preprocess_func, get_dataset_from_repo, **d_info, exist_ok=True)
 
 
@@ -809,30 +815,10 @@ def reorganize_row(row):
     get_dataset_from_repo,
     tags=['rlhf', 'dpo', 'pairwise'])
 
-
-def _preprocess_sharegpt(dataset: HfDataset) -> HfDataset:
-    query = []
-    response = []
-    history: List[History] = []
-    for d in tqdm(dataset):
-        if isinstance(d['conversation'], str):
-            try:
-                conversation = ast.literal_eval(d['conversation'])
-            except SyntaxError:
-                continue
-        query.append(conversation[-1]['human'])
-        response.append(conversation[-1]['assistant'])
-        h = []
-        for c in conversation[:-1]:
-            h.append([c['human'], c['assistant']])
-        history.append(h)
-    return HfDataset.from_dict({'query': query, 'response': response, 'history': history})
-
-
 register_dataset(
     DatasetName.sharegpt,
     'huangjintao/sharegpt', ['common-zh', 'computer-zh', 'unknow-zh', 'common-en', 'computer-en'],
-    _preprocess_sharegpt,
+    preprocess_sharegpt,
     get_dataset_from_repo,
     tags=['chat', 'general', 'multi-round'])
 
diff --git a/swift/llm/utils/preprocess.py b/swift/llm/utils/preprocess.py
@@ -144,7 +144,7 @@ def __call__(self, dataset: HfDataset) -> HfDataset:
             'query': query,
             'response': response,
         })
-        dataset = HfDataset.from_dict({**kwargs})
+        dataset = HfDataset.from_dict(kwargs)
         return dataset
 
 
@@ -170,6 +170,41 @@ def __call__(self, dataset: HfDataset) -> HfDataset:
         return dataset
 
 
+def preprocess_sharegpt(dataset: HfDataset) -> HfDataset:
+    query = []
+    response = []
+    system: List[Optional[str]] = []
+    has_system = False
+    history: List[History] = []
+    has_history = False
+    for d in tqdm(dataset):
+        if isinstance(d['conversation'], str):
+            try:
+                conversation = ast.literal_eval(d['conversation'])
+            except SyntaxError:
+                continue
+        else:
+            conversation = d['conversation']
+        query.append(conversation[-1]['human'])
+        response.append(conversation[-1]['assistant'])
+        h = []
+        for c in conversation[:-1]:
+            h.append([c['human'], c['assistant']])
+        if len(h) > 0:
+            has_history = True
+        history.append(h)
+        sys = d.get('system')
+        if sys is not None:
+            has_system = True
+        system.append(sys)
+    kwargs = {'query': query, 'response': response}
+    if has_history:
+        kwargs['history'] = history
+    if has_system:
+        kwargs['system'] = system
+    return HfDataset.from_dict(kwargs)
+
+
 class SmartPreprocessor:
 
     def __init__(self) -> None:
@@ -182,14 +217,18 @@ def __init__(self) -> None:
                 'required': ['instruction', 'output'],
                 'preprocessor': AlpacaPreprocessor()
             },
-            'conversations': {
+            'conversations': {  # qwen
                 'required': ['conversations'],
                 'preprocessor': ConversationsPreprocessor()
             },
             'chatml': {
                 'required': ['messages'],
                 'preprocessor':
                 ConversationsPreprocessor(conversations_key='messages', from_key='role', value_key='content')
+            },
+            'sharegpt': {
+                'required': ['conversation'],
+                'preprocessor': preprocess_sharegpt
             }
         }
 
diff --git a/swift/ui/llm_infer/llm_infer.py b/swift/ui/llm_infer/llm_infer.py
@@ -362,26 +362,15 @@ def send_message(cls, running_task, model_and_template, template_type, prompt: s
         stream_resp_with_history = ''
         if not template_type.endswith('generation'):
             stream_resp = inference_client(
-                model_type,
-                prompt,
-                old_history,
-                system=system,
-                port=args['port'],
-                adapter_name='default-lora' if sft_type in ('lora', 'longlora') else None,
-                request_config=request_config)
+                model_type, prompt, old_history, system=system, port=args['port'], request_config=request_config)
             for chunk in stream_resp:
                 stream_resp_with_history += chunk.choices[0].delta.content
                 qr_pair = [prompt, stream_resp_with_history]
                 total_history = old_history + [qr_pair]
                 yield '', total_history
         else:
             request_config.max_tokens = max_new_tokens
-            stream_resp = inference_client(
-                model_type,
-                prompt,
-                port=args['port'],
-                adapter_name='default-lora' if sft_type in ('lora', 'longlora') else None,
-                request_config=request_config)
+            stream_resp = inference_client(model_type, prompt, port=args['port'], request_config=request_config)
             for chunk in stream_resp:
                 stream_resp_with_history += chunk.choices[0].text
                 qr_pair = [prompt, stream_resp_with_history]
diff --git a/tests/llm/data/sharegpt.jsonl b/tests/llm/data/sharegpt.jsonl
@@ -0,0 +1,3 @@
+{"system": "00000", "conversation": [{"human": "11111", "assistant": "22222"}]}
+{"conversation": [{"human": "aaaaa", "assistant": "bbbbb"}]}
+{"conversation": [{"human": "AAAAA", "assistant": "BBBBB"}, {"human": "CCCCC", "assistant": "DDDDD"}, {"human": "EEEEE", "assistant": "FFFFF"}]}
diff --git a/tests/llm/test_run.py b/tests/llm/test_run.py
@@ -206,7 +206,8 @@ def test_custom_dataset(self):
             # ignore citest error in github
             return
         train_dataset_fnames = [
-            'alpaca.csv', 'chatml.jsonl', 'swift_pre.jsonl', 'swift_single.csv', 'swift_multi.jsonl', 'swift_multi.json'
+            'alpaca.csv', 'chatml.jsonl', 'swift_pre.jsonl', 'swift_single.csv', 'swift_multi.jsonl',
+            'swift_multi.json', 'sharegpt.jsonl'
         ]
         val_dataset_fnames = [
             'alpaca.jsonl', 'alpaca2.csv', 'conversations.jsonl', 'swift_pre.csv', 'swift_single.jsonl'

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+{"system": "00000", "conversation": [{"human": "11111", "assistant": "22222"}]}`
	`2`	`+{"conversation": [{"human": "aaaaa", "assistant": "bbbbb"}]}`
	`3`	`+{"conversation": [{"human": "AAAAA", "assistant": "BBBBB"}, {"human": "CCCCC", "assistant": "DDDDD"}, {"human": "EEEEE", "assistant": "FFFFF"}]}`