Skip to content

Commit 8ee07ac

Browse files
authored
[Feature] Add A New Approach of Dataset Integration based on ChatML Template with Eval Examples (#2277)
* test * add feature * fix * add chatml dataset * fix * fix * fix * add datasets, example and doc * fix * fix
1 parent 2d3bf74 commit 8ee07ac

File tree

22 files changed

+1074
-26
lines changed

22 files changed

+1074
-26
lines changed

docs/en/advanced_guides/custom_dataset.md

Lines changed: 127 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,132 @@
1-
# Custom Dataset Tutorial
1+
# Dataset Quick Evaluation Tutorial
22

3-
This tutorial is intended for temporary and informal use of datasets. If the dataset requires long-term use or has specific needs for custom reading/inference/evaluation, it is strongly recommended to implement it according to the methods described in [new_dataset.md](./new_dataset.md).
3+
OpenCompass provides two paths for quickly evaluating the provided data, the data format protocol based on ChatMLDataset and the data format protocol based on CustomDataset.
4+
Compared to the complete dataset integration process in [new_dataset.md](./new_dataset.md), these two evaluation paths are more convenient and efficient, being able to directly enter the evaluation process without adding new configuration files.
5+
But if you have specific needs for custom reading/inference/evaluation, it is recommended to still follow the complete integration process to add a new dataset.
46

5-
In this tutorial, we will introduce how to test a new dataset without implementing a config or modifying the OpenCompass source code. We support two types of tasks: multiple choice (`mcq`) and question & answer (`qa`). For `mcq`, both ppl and gen inferences are supported; for `qa`, gen inference is supported.
7+
## Data Format Protocol and Fast Evaluation Based on ChatMLDataset
68

7-
## Dataset Format
9+
OpenCompass has recently launched a dataset evaluation mode based on the ChatML dialogue template, which allow users to provide a dataset .json file that conforms to the ChatML dialogue template, and simply set the dataset information config like model configs to start evaluating directly.
10+
11+
### Format Requirements for Data Files
12+
13+
This evaluation method only supports data files in `.json` format, and each sample must comply with the following format:
14+
15+
The format of a text-only dataset with a simple structure:
16+
17+
```jsonl
18+
{
19+
"question":[
20+
{
21+
"role": "system" # Omittable
22+
"content": Str
23+
},
24+
{
25+
"role": "user",
26+
"content": Str
27+
}
28+
],
29+
"answer":[
30+
Str
31+
]
32+
}
33+
{
34+
...
35+
}
36+
...
37+
```
38+
39+
The format of multiple rounds and multiple modes datasets:
40+
41+
```jsonl
42+
{
43+
"question":[
44+
{
45+
"role": "system",
46+
"content": Str,
47+
},
48+
{
49+
"role": "user",
50+
"content": Str or List
51+
[
52+
{
53+
"type": Str, # "image"
54+
"image_url": Str,
55+
},
56+
...
57+
{
58+
"type": Str, # "text"
59+
"text": Str,
60+
},
61+
]
62+
},
63+
{
64+
"role": "assistant",
65+
"content": Str
66+
},
67+
{
68+
"role": "user",
69+
"content": Str or List
70+
},
71+
...
72+
],
73+
"answer":[
74+
Str,
75+
Str,
76+
...
77+
]
78+
}
79+
{
80+
...
81+
}
82+
...
83+
```
84+
85+
(As OpenCompass currently does not support multi-mode evaluation, the template above is for reference only.)
86+
87+
When ChatMLDataset reading `.json` files, it will use `pydantic` to perform simple format validation on the files.
88+
You can use `tools/chatml_fformat_test.py` to check your provided data file.
89+
90+
After format checking, please add a config dictionary named `chatml_datasets` in your running config file to convert the data file into an OpenCompass dataset at runtime.
91+
An example is as follows:
92+
93+
```python
94+
chatml_datasets = [
95+
dict(
96+
abbr='YOUR_DATASET_NAME',
97+
path='YOUR_DATASET_PATH',
98+
evaluator=dict(
99+
type='cascade_evaluator',
100+
rule_evaluator=dict(
101+
type='math_evaluator',
102+
),
103+
llm_evaluator=dict(
104+
type='llm_evaluator',
105+
prompt="YOUR_JUDGE_PROMPT",
106+
judge_cfg=dict(), # YOUR Judge Model Config
107+
)
108+
),
109+
n=1, # Repeat Number
110+
),
111+
]
112+
```
113+
114+
The ChatML evaluation module currently provides four preset evaluators, `mcq_rule_evaluator` used for MCQ evaluation, `math_evaluator` used for latex mathematical formula evaluation, `llm_evaluator` used for evaluating answers that are open-ended or difficult to extract), and `cascade_evaluator`, an evaluation mode composed of rule and LLM evaluators cascaded together.
115+
116+
In addition, if you have a long-term need to use datasets based on ChatML templates, you can contribute your dataset config to `opencompass/config/chatml_datasets`.
117+
An eval example of calling these dataset configs is provided in `examples/evalchat_datasets.py`.
118+
119+
## Data Format Protocol and Fast Evaluation Based on CustomsDataset
120+
121+
(This module is no longer being updated, but it can still be used if there is a need for cli- quick evaluation.)
122+
123+
This module support two types of tasks: multiple choice (`mcq`) and question & answer (`qa`). For `mcq`, both ppl and gen inferences are supported; for `qa`, gen inference is supported.
124+
125+
### Dataset Format
8126

9127
We support datasets in both `.jsonl` and `.csv` formats.
10128

11-
### Multiple Choice (`mcq`)
129+
#### Multiple Choice (`mcq`)
12130

13131
For `mcq` datasets, the default fields are as follows:
14132

@@ -37,7 +155,7 @@ question,A,B,C,answer
37155
504+811+870+445=,2615,2630,2750,B
38156
```
39157

40-
### Question & Answer (`qa`)
158+
#### Question & Answer (`qa`)
41159

42160
For `qa` datasets, the default fields are as follows:
43161

@@ -65,7 +183,7 @@ question,answer
65183
649+215+412+495+220+738+989+452=,4170
66184
```
67185

68-
## Command Line List
186+
### Command Line List
69187

70188
Custom datasets can be directly called for evaluation through the command line.
71189

@@ -92,7 +210,7 @@ set them based on the following logic:
92210
- If options like `A`, `B`, `C`, etc., can be parsed from the dataset file, it is considered an `mcq` dataset; otherwise, it is considered a `qa` dataset.
93211
- The default `infer_method` is `gen`.
94212

95-
## Configuration File
213+
### Configuration File
96214

97215
In the original configuration file, simply add a new item to the `datasets` variable. Custom datasets can be mixed with regular datasets.
98216

@@ -103,7 +221,7 @@ datasets = [
103221
]
104222
```
105223

106-
## Supplemental Information for Dataset `.meta.json`
224+
### Supplemental Information for Dataset `.meta.json`
107225

108226
OpenCompass will try to parse the input dataset file by default, so in most cases, the `.meta.json` file is **not necessary**. However, if the dataset field names are not the default ones, or custom prompt words are required, it should be specified in the `.meta.json` file.
109227

docs/zh_cn/advanced_guides/custom_dataset.md

Lines changed: 123 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,128 @@
1-
# 自定义数据集
1+
# 快速评测数据集
22

3-
本教程仅供临时性的、非正式的数据集使用,如果所用数据集需要长期使用,或者存在定制化读取 / 推理 / 评测需求的,强烈建议按照 [new_dataset.md](./new_dataset.md) 中介绍的方法进行实现。
3+
OpenCompass提供了两种快速对提供的数据进行评测的路径,即基于ChatMLDataset的数据格式协议和基于CustomDataset的数据格式协议。
4+
相较于 [new_dataset.md](./new_dataset.md) 中的完整数据集集成流程,这两种快速评测路径更加方便快捷,能够在免于增加新配置文件的前提下直接进入评测任务阶段。但如果您存在定制化读取 / 推理 / 评测需求的,建议仍按照完整的集成流程加入新数据集。
45

5-
在本教程中,我们将会介绍如何在不实现 config,不修改 OpenCompass 源码的情况下,对一新增数据集进行测试的方法。我们支持的任务类型包括选择 (`mcq`) 和问答 (`qa`) 两种,其中 `mcq` 支持 `ppl` 推理和 `gen` 推理;`qa` 支持 `gen` 推理。
6+
## 基于ChatMLDataset的数据格式协议和快速评测
67

7-
## 数据集格式
8+
OpenCompass最新推出的基于ChatML对话模板的数据集评测模式,允许用户提供一个符合ChatML对话模板的数据集.jsonl文件,并像配置模型一样对数据集信息进行简单配置后即可直接开始评测任务。
9+
10+
### 数据集文件的格式要求
11+
12+
本评测方式仅支持`.jsonl`格式的数据集文件,且其中的每条数据均需遵守以下格式:
13+
14+
较简易结构的文本数据集:
15+
16+
```jsonl
17+
{
18+
"question":[
19+
{
20+
"role": "system" # 可省略
21+
"content": Str
22+
},
23+
{
24+
"role": "user",
25+
"content": Str
26+
}
27+
],
28+
"answer":[
29+
Str
30+
]
31+
}
32+
{
33+
...
34+
}
35+
...
36+
```
37+
38+
多轮多模等复杂情况的数据集:(由于OpenCompass暂未支持多模评测,因此此处模板仅供参考)
39+
40+
```jsonl
41+
{
42+
"question":[
43+
{
44+
"role": "system",
45+
"content": Str,
46+
},
47+
{
48+
"role": "user",
49+
"content": Str or List
50+
[
51+
{
52+
"type": Str, # "image"
53+
"image_url": Str,
54+
},
55+
...
56+
{
57+
"type": Str, # "text"
58+
"text": Str,
59+
},
60+
]
61+
},
62+
{
63+
"role": "assistant",
64+
"content": Str
65+
},
66+
{
67+
"role": "user",
68+
"content": Str or List
69+
},
70+
...
71+
],
72+
"answer":[
73+
Str,
74+
Str,
75+
...
76+
]
77+
}
78+
{
79+
...
80+
}
81+
...
82+
```
83+
84+
`ChatMLDataset`在读取.jsonl文件时,会使用`pydantic`库对文件进行简易的格式校验。
85+
您可以使用`tools/chatml_format_test.py`对提供的数据文件进行检查。
86+
87+
完成数据检查后,需要在运行配置文件中加入字段名为`chatml_datasets`的配置字典,以在运行时将数据文件转化为OpenCompass的数据集。示例如下:
88+
89+
```python
90+
chatml_datasets = [
91+
dict(
92+
abbr='YOUR_DATASET_NAME',
93+
path='YOUR_DATASET_PATH',
94+
evaluator=dict(
95+
type='cascade_evaluator',
96+
rule_evaluator=dict(
97+
type='math_evaluator',
98+
),
99+
llm_evaluator=dict(
100+
type='llm_evaluator',
101+
prompt="YOUR_JUDGE_PROMPT",
102+
judge_cfg=dict(), # YOUR Judge Model Config
103+
)
104+
),
105+
n=1, # Repeat Number
106+
),
107+
]
108+
```
109+
110+
目前,ChatML模块内提供了四种预设的Evaluator,分别是`mcq_rule_evaluator`(用于选择题评估)、`math_evaluator`(用于latex数学公式评估)、`llm_evaluator`(用于评估难以提取答案的题目或开放式题目)、`cascade_evaluator`(规则式和LLM评估器级联组成的评估模式)。
111+
112+
此外,如果您有基于ChatML模板长期使用数据集的需求,可以将配置添加到`opencompass/configs/chatml_datasets`中。
113+
`examples/eval_chat_datasets.py`中也给出了调用这类数据集配置的评测示例。
114+
115+
## 基于CustomDataset的数据格式协议和快速评测
116+
117+
(此模块已不再进行更新,但若存在命令行快速运行评测等需求,仍可以使用此模块。)
118+
119+
基于CustomDataset的数据格式协议支持的任务类型包括选择 (`mcq`) 和问答 (`qa`) 两种,其中 `mcq` 支持 `ppl` 推理和 `gen` 推理;`qa` 支持 `gen` 推理。
120+
121+
### 数据集格式
8122

9123
我们支持 `.jsonl``.csv` 两种格式的数据集。
10124

11-
### 选择题 (`mcq`)
125+
#### 选择题 (`mcq`)
12126

13127
对于选择 (`mcq`) 类型的数据,默认的字段如下:
14128

@@ -37,7 +151,7 @@ question,A,B,C,answer
37151
504+811+870+445=,2615,2630,2750,B
38152
```
39153

40-
### 问答题 (`qa`)
154+
#### 问答题 (`qa`)
41155

42156
对于问答 (`qa`) 类型的数据,默认的字段如下:
43157

@@ -65,7 +179,7 @@ question,answer
65179
649+215+412+495+220+738+989+452=,4170
66180
```
67181

68-
## 命令行列表
182+
### 命令行列表
69183

70184
自定义数据集可直接通过命令行来调用开始评测。
71185

@@ -90,7 +204,7 @@ python run.py \
90204
- 如果从数据集文件中可以解析出选项,如 `A`, `B`, `C` 等,则认定该数据集为 `mcq`,否则认定为 `qa`
91205
- 默认 `infer_method``gen`
92206

93-
## 配置文件
207+
### 配置文件
94208

95209
在原配置文件中,直接向 `datasets` 变量中添加新的项即可即可。自定义数据集亦可与普通数据集混用。
96210

@@ -101,7 +215,7 @@ datasets = [
101215
]
102216
```
103217

104-
## 数据集补充信息 `.meta.json`
218+
### 数据集补充信息 `.meta.json`
105219

106220
OpenCompass 会默认尝试对输入的数据集文件进行解析,因此在绝大多数情况下,`.meta.json` 文件都是 **不需要** 的。但是,如果数据集的字段名不是默认的字段名,或者需要自定义提示词,则需要在 `.meta.json` 文件中进行指定。
107221

examples/eval_chatml_datasets.py

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
# flake8: noqa
2+
3+
from mmengine.config import read_base
4+
5+
from opencompass.partitioners import NaivePartitioner, NumWorkerPartitioner
6+
from opencompass.runners import LocalRunner
7+
from opencompass.tasks import OpenICLEvalTask, OpenICLInferTask
8+
9+
#######################################################################
10+
# PART 0 Essential Configs #
11+
#######################################################################
12+
with read_base():
13+
14+
# Models (add your models here)
15+
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_7b_chat import \
16+
models as hf_internlm2_5_7b_chat_model
17+
18+
# Datasets
19+
from opencompass.configs.chatml_datasets.MaScQA.MaScQA_gen import datasets as MaScQA_chatml
20+
from opencompass.configs.chatml_datasets.CPsyExam.CPsyExam_gen import datasets as CPsyExam_chatml
21+
22+
23+
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
24+
25+
chatml_datasets = sum(
26+
(v for k, v in locals().items() if k.endswith('_chatml')),
27+
[],
28+
)
29+
30+
# Your Judge Model Configs Here
31+
judge_cfg = dict()
32+
33+
for dataset in chatml_datasets:
34+
if dataset['evaluator']['type'] == 'llm_evaluator':
35+
dataset['evaluator']['judge_cfg'] = judge_cfg
36+
if dataset['evaluator']['type'] == 'cascade_evaluator':
37+
dataset['evaluator']['llm_evaluator']['judge_cfg'] = judge_cfg
38+
39+
infer = dict(
40+
partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
41+
runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask)),
42+
)
43+
44+
eval = dict(
45+
partitioner=dict(type=NaivePartitioner, n=8),
46+
runner=dict(
47+
type=LocalRunner, task=dict(type=OpenICLEvalTask), max_num_workers=32
48+
),
49+
)
50+
51+
work_dir = 'outputs/ChatML_Datasets'

0 commit comments

Comments
 (0)