Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions applications/zero_shot_text_classification/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
简体中文 | [English](README_en.md)

# 零样本文本分类

**目录**
Expand Down Expand Up @@ -27,7 +29,7 @@
**零样本文本分类应用亮点:**

- **覆盖场景全面🎓:** 覆盖文本分类各类主流任务,支持多任务训练,满足开发者多样文本分类落地需求。
- **效果领先🏃:** 具有突出分类效果的UTC模型作为训练基座,提供良好的零样本和小样本学习能力。
- **效果领先🏃:** 具有突出分类效果的UTC模型作为训练基座,提供良好的零样本和小样本学习能力。该模型在[ZeroCLUE](https://www.cluebenchmarks.com/zeroclue.html)和[FewCLUE](https://www.cluebenchmarks.com/fewclue.html)均取得榜首(截止2023年1月11日)。
- **简单易用:** 通过Taskflow实现三行代码可实现无标注数据的情况下进行快速调用,一行命令即可开启文本分类,轻松完成部署上线,降低多任务文本分类落地门槛。
- **高效调优✊:** 开发者无需机器学习背景知识,即可轻松上手数据标注及模型训练流程。

Expand Down Expand Up @@ -156,7 +158,9 @@ python -u -m paddle.distributed.launch --gpus "0,1" run_train.py \
* `seed`:全局随机种子,默认为 42。
* `model_name_or_path`:进行 few shot 训练使用的预训练模型。默认为 "utc-large"。
* `output_dir`:必须,模型训练或压缩后保存的模型目录;默认为 `None` 。
* `dev_path`:开发集路径;默认为 `None` 。
* `dataset_path`:数据集文件所在目录;默认为 `./data/` 。
* `train_file`:训练集后缀;默认为 `train.txt` 。
* `dev_file`:开发集后缀;默认为 `dev.txt` 。
* `max_seq_len`:文本最大切分长度,包括标签的输入超过最大长度时会对输入文本进行自动切分,标签部分不可切分,默认为512。
* `per_device_train_batch_size`:用于训练的每个 GPU 核心/CPU 的batch大小,默认为8。
* `per_device_eval_batch_size`:用于评估的每个 GPU 核心/CPU 的batch大小,默认为8。
Expand Down
253 changes: 253 additions & 0 deletions applications/zero_shot_text_classification/README_en.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,253 @@
[简体中文](README.md) | English

# Zero-shot Text Classification

**Table of contents**
- [1. Zero-shot Text Classification Application](#1)
- [2. Quick Start](#2)
- [2.1 Code Structure](#21)
- [2.2 Data Annotation](#22)
- [2.3 Finetuning](#23)
- [2.4 Evaluation](#24)
- [2.5 Inference](#25)
- [2.6 Deployment](#26)
- [2.7 Experiments](#27)

<a name="1"></a>

## 1. Zero-shot Text Classification

This project provides an end-to-end application solution for universal text classification based on Universal Task Classification (UTC) finetuning and goes through the full lifecycle of **data labeling, model training and model deployment**. We hope this guide can help you apply Text Classification techniques with zero-shot ability in your own products or models.

<div align="center">
<img width="700" alt="UTC模型结构图" src="https://user-images.githubusercontent.com/25607475/212268807-66181bcb-d3f9-4086-9d4a-de4d1d0933c2.png">
</div>

Text Classification refers to assigning a set of categories to given input text. Despite the advantages of tuning, applying text classification techniques in practice remains a challenge due to domain adaption and lack of labeled data, etc. This PaddleNLP Zero-shot Text Classification Guide builds on our UTC from the Unified Semantic Matching (USM) model series and provides an industrial-level solution that supports universal text classification tasks, including but not limited to **semantic analysis, semantic matching, intention recognition and event detection**, allowing you accomplish multiple tasks with a single model. Besides, our method brings good generation performance through multi-task pretraining.

**Highlights:**

- **Comprehensive Coverage**🎓: Covers various mainstream tasks of text classification, including but not limited to semantic analysis, semantic matching, intention recognition and event detection.

- **State-of-the-Art Performance**🏃: Strong performance from the UTC model, which ranks first on [ZeroCLUE](https://www.cluebenchmarks.com/zeroclue.html)/[FewCLUE](https://www.cluebenchmarks.com/fewclue.html) as of 01/11/2023.

- **Easy to use**⚡: Three lines of code to use our Taskflow for out-of-box Zero-shot Text Classification capability. One line of command to model training and model deployment.

- **Efficient Tuning**✊: Developers can easily get started with the data labeling and model training process without a background in Machine Learning.

<a name="2"></a>

## 2. Quick start

For quick start, you can directly use ```paddlenlp.Taskflow``` out-of-the-box, leveraging the zero-shot performance. For production use cases, we recommend labeling a small amount of data for model fine-tuning to further improve the performance.

<a name="21"></a>

### 2.1 Code structure

```shell
.
├── deploy/simple_serving/ # model deployment script
├── utils.py # data processing tools
├── run_train.py # model fine-tuning script
├── run_eval.py # model evaluation script
├── label_studio.py # data format conversion script
├── label_studio_text.md # data annotation instruction
└── README.md
```
<a name="22"></a>

### 2.2 Data labeling

We recommend using [Label Studio](https://labelstud.io/) for data labeling. You can export labeled data in Label Studio and convert them into the required input format. Please refer to [Label Studio Data Labeling Guide](./label_studio_text_en.md) for more details.

Here we provide a pre-labeled example dataset `Medical Question Intent Classification Dataset`, which you can download with the following command. We will show how to use the data conversion script to generate training/validation/test set files for fine-tuning.

Download the medical question intent classification dataset:

```shell
wget https://bj.bcebos.com/paddlenlp/datasets/utc-medical.tar.gz
tar -xvf utc-medical.tar.gz
mv utc-medical data
rm utc-medical.tar.gz
```

Generate training/validation set files:

```shell
python label_studio.py \
--label_studio_file ./data/label_studio.json \
--save_dir ./data \
--splits 0.8 0.1 0.1 \
--options ./data/label.txt
```

For multi-task training, you can convert data with script seperately and move them to the same directory.

<a name="23"></a>

### 2.3 Finetuning

Use the following command to fine-tune the model using `utc-large` as the pre-trained model, and save the fine-tuned model to `./checkpoint/model_best/`:

Single GPU:

```shell
python run_train.py \
--device gpu \
--logging_steps 10 \
--save_steps 100 \
--eval_steps 100 \
--seed 1000 \
--model_name_or_path utc-large \
--output_dir ./checkpoint/model_best \
--dataset_path ./data/ \
--max_seq_length 512 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--gradient_accumulation_steps 8 \
--num_train_epochs 20 \
--learning_rate 1e-5 \
--do_train \
--do_eval \
--do_export \
--export_model_dir ./checkpoint/model_best \
--overwrite_output_dir \
--disable_tqdm True \
--metric_for_best_model macro_f1 \
--load_best_model_at_end True \
--save_total_limit 1
```

Multiple GPUs:

```shell
python -u -m paddle.distributed.launch --gpus "0,1" run_train.py \
--device gpu \
--logging_steps 10 \
--save_steps 100 \
--eval_steps 100 \
--seed 1000 \
--model_name_or_path utc-large \
--output_dir ./checkpoint/model_best \
--dataset_path ./data/ \
--max_seq_length 512 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--gradient_accumulation_steps 8 \
--num_train_epochs 20 \
--learning_rate 1e-5 \
--do_train \
--do_eval \
--do_export \
--export_model_dir ./checkpoint/model_best \
--overwrite_output_dir \
--disable_tqdm True \
--metric_for_best_model macro_f1 \
--load_best_model_at_end True \
--save_total_limit 1
```

Parameters:

* `device`: Training device, one of 'cpu' and 'gpu' can be selected; the default is GPU training.
* `logging_steps`: The interval steps of log printing during training, the default is 10.
* `save_steps`: The number of interval steps to save the model checkpoint during training, the default is 100.
* `eval_steps`: The number of interval steps to save the model checkpoint during training, the default is 100.
* `seed`: global random seed, default is 42.
* `model_name_or_path`: The pre-trained model used for few shot training. Defaults to "utc-large".
* `output_dir`: Required, the model directory saved after model training or compression; the default is `None`.
* `dataset_path`: The directory to dataset; defaults to `./data`.
* `train_file`: Training file name; defaults to `train.txt`.
* `dev_file`: Development file name; defaults to `dev.txt`.
* `max_seq_len`: The maximum segmentation length of the text and label candidates. When the input exceeds the maximum length, the input text will be automatically segmented. The default is 512.
* `per_device_train_batch_size`: The batch size of each GPU core/CPU used for training, the default is 8.
* `per_device_eval_batch_size`: Batch size per GPU core/CPU for evaluation, default is 8.
* `num_train_epochs`: Training rounds, 100 can be selected when using early stopping method; the default is 10.
* `learning_rate`: The maximum learning rate for training, UTC recommends setting it to 1e-5; the default value is 3e-5.
* `do_train`: Whether to perform fine-tuning training, setting this parameter means to perform fine-tuning training, and it is not set by default.
* `do_eval`: Whether to evaluate, setting this parameter means to evaluate, the default is not set.
* `do_export`: Whether to export, setting this parameter means to export static graph, and it is not set by default.
* `export_model_dir`: Static map export address, the default is `./checkpoint/model_best`.
* `overwrite_output_dir`: If `True`, overwrite the contents of the output directory. If `output_dir` points to a checkpoint directory, use it to continue training.
* `disable_tqdm`: Whether to use tqdm progress bar.
* `metric_for_best_model`: Optimal model metric, UTC recommends setting it to `macro_f1`, the default is None.
* `load_best_model_at_end`: Whether to load the best model after training, usually used in conjunction with `metric_for_best_model`, the default is False.
* `save_total_limit`: If this parameter is set, the total number of checkpoints will be limited. Remove old checkpoints `output directory`, defaults to None.

<a name="24"></a>

### 2.4 Evaluation

Model evaluation:

```shell
python evaluate.py \
--model_path ./checkpoint/model_best \
--test_path ./data/test.txt \
--per_device_eval_batch_size 2 \
--max_seq_len 512 \
--output_dir ./checkpoint_test
```

Parameters:

- `model_path`: The path of the model folder for evaluation, which must contain the model weight file `model_state.pdparams` and the configuration file `model_config.json`.
- `test_path`: The test set file for evaluation.
- `per_device_eval_batch_size`: Batch size, please adjust it according to the machine situation, the default is 8.
- `max_seq_len`: The maximum segmentation length of the text and label candidates. When the input exceeds the maximum length, the input text will be automatically segmented. The default is 512.

<a name="25"></a>

### 2.5 Inference

You can use `paddlenlp.Taskflow` to load your custom model by specifying the path of the model weight file through `task_path`.

```python
>>> from pprint import pprint
>>> from paddlenlp import Taskflow
>>> schema = ["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议", "疾病表述", "后果表述", "注意事项", "功效作用", "医疗费用", "其他"]
>>> my_cls = Taskflow("zero_shot_text_classification", schema=schema, task_path='./checkpoint/model_best', precision="fp16")
>>> pprint(my_cls("中性粒细胞比率偏低"))
```

<a name="26"></a>

### 2.6 Deployment

We provide the deployment solution on the foundation of PaddleNLP SimpleServing, where you can easily build your own deployment service with three-line code.

```
# Save at server.py
from paddlenlp import SimpleServer, Taskflow

schema = ["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议"]
utc = Taskflow("zero_shot_text_classification",
schema=schema,
task_path="../../checkpoint/model_best/",
precision="fp32")
app = SimpleServer()
app.register_taskflow("taskflow/utc", utc)
```

```
# Start the server
paddlenlp server server:app --host 0.0.0.0 --port 8990
```

It supports FP16 (half-precision) and multiple process for inference acceleration.

<a name="27"></a>

### 2.7 Experiments

The results reported here are based on the development set of KUAKE-QIC.

| | Accuracy | Micro F1 | Macro F1 |
| :------: | :--------: | :--------: | :--------: |
| 0-shot | 28.69 | 87.03 | 60.90 |
| 5-shot | 64.75 | 93.34 | 80.33 |
| 10-shot | 65.88 | 93.76 | 81.34 |
| full-set | 81.81 | 96.65 | 89.87 |

where k-shot means that there are k annotated samples per label for training.
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
简体中文 | [English](label_studio_text_en.md)

# 文本分类任务Label Studio使用指南

**目录**
Expand Down Expand Up @@ -105,7 +107,7 @@ label-studio start

将导出的文件重命名为``label_studio.json``后,放入``./data``目录下。通过[label_studio.py](./label_studio.py)脚本可转为UTC的数据格式。

在数据转换阶段,我们会自动构造用于模型训练的标签候选信息。例如在医疗意图分类中,标签候选为``["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议", "疾病表述", "后果表述", "注意事项", "功效作用", "医疗费用", "其他"]``,可通过``options``参数进行配置
在数据转换阶段,还需要提供标签候选信息,放在`./data/label.txt`文件中,每个标签占一行。例如在医疗意图分类中,标签候选为``["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议", "疾病表述", "后果表述", "注意事项", "功效作用", "医疗费用", "其他"]``,也可通过``options``参数直接进行配置

```shell
python label_studio.py \
Expand All @@ -122,7 +124,7 @@ python label_studio.py \
- ``label_studio_file``: 从label studio导出的数据标注文件。
- ``save_dir``: 训练数据的保存目录,默认存储在``data``目录下。
- ``splits``: 划分数据集时训练集、验证集所占的比例。默认为[0.8, 0.1, 0.1]表示按照``8:1:1``的比例将数据划分为训练集、验证集和测试集。
- ``options``: 指定分类任务的类别标签。若输入类型为文件,则文件中每行一个标签。默认为None,自动从输入数据中构造标签候选集合,当数据量大时耗时较长。
- ``options``: 指定分类任务的类别标签。若输入类型为文件,则文件中每行一个标签。
- ``is_shuffle``: 是否对数据集进行随机打散,默认为True。
- ``seed``: 随机种子,默认为1000.

Expand Down
Loading