PaddlePaddle
diff --git a/‎.github/workflows/fast_tokenizer.yml‎
Lines changed: 48 additions & 0 deletions b/‎.github/workflows/fast_tokenizer.yml‎
Lines changed: 48 additions & 0 deletions
diff --git a/‎.readthedocs.yaml‎
Lines changed: 2 additions & 2 deletions b/‎.readthedocs.yaml‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎applications/information_extraction/label_studio_text.md‎
Lines changed: 2 additions & 2 deletions b/‎applications/information_extraction/label_studio_text.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎applications/information_extraction/label_studio_text_en.md‎
Lines changed: 2 additions & 2 deletions b/‎applications/information_extraction/label_studio_text_en.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎applications/zero_shot_text_classification/README.md‎
Lines changed: 12 additions & 6 deletions b/‎applications/zero_shot_text_classification/README.md‎
Lines changed: 12 additions & 6 deletions
@@ -0,0 +1,48 @@
+name: FastTokenizer
+
+on:
+  push:
+    paths:
+      - 'fast_tokenizer/*'
+  pull_request:
+    paths:
+      - 'fast_tokenizer/*'
+
+jobs:
+  fast_tokenizer_cpp:
+    name: fast_tokenizer_cpp
+    runs-on: ubuntu-22.04
+    permissions:
+      pull-requests: write
+      contents: read
+      id-token: write
+    steps:
+      - uses: actions/checkout@v3
+      - name: compile
+        working-directory: ./fast_tokenizer
+        run: make fast_tokenizer_cpp_compile
+      - name: test
+        working-directory: ./fast_tokenizer
+        run: make fast_tokenizer_cpp_test
+  fast_tokenizer_python38:
+      name: fast_tokenizer_python38
+      runs-on: ubuntu-22.04
+      permissions:
+        pull-requests: write
+        contents: read
+        id-token: write
+      steps:
+        - uses: actions/checkout@v3
+        - uses: actions/setup-python@v1
+          with:
+            python-version: 3.8
+        - name: install
+          working-directory: ./fast_tokenizer
+          run: make fast_tokenizer_python_install
+        - name: compile
+          working-directory: ./fast_tokenizer
+          run: make fast_tokenizer_python_compile
+        - name: test
+          working-directory: ./fast_tokenizer
+          run: make fast_tokenizer_python_test
+  
@@ -21,8 +21,8 @@ sphinx:
 python:
   version: 3.8
   install:
-    - requirements: docs/requirements.txt
     - requirements: requirements.txt
+    - requirements: docs/requirements.txt
     - method: setuptools
       path: .
-  system_packages: true
+  system_packages: true
@@ -56,10 +56,10 @@ label-studio start
     <img src=https://user-images.githubusercontent.com/40840292/199661638-48a870eb-a1df-4db5-82b9-bc8e985f5190.png height=350 width=1200 />
 </div>
 
-- **文本分类、句子级情感倾向分类**任务选择``Relation Extraction`。
+- **文本分类、句子级情感倾向分类**任务选择``Text Classification``。
 
 <div align="center">
-    <img src=https://user-images.githubusercontent.com/40840292/199661638-48a870eb-a1df-4db5-82b9-bc8e985f5190.png height=350 width=1200 />
+    <img src=https://user-images.githubusercontent.com/40840292/212617773-34534e68-4544-4b24-8f39-ae7f9573d397.png height=420 width=1200 />
 </div>
 
 - 添加标签(也可跳过后续在Setting/Labeling Interface中配置)
 
@@ -57,10 +57,10 @@ Click Create to start creating a new project, fill in the project name, descript
      <img src=https://user-images.githubusercontent.com/40840292/199661638-48a870eb-a1df-4db5-82b9-bc8e985f5190.png height=350 width=1200 />
 </div>
 
-- For **Text classification, Sentence-level sentiment classification** tasks please select ``Relation Extraction`.
+- For **Text classification, Sentence-level sentiment classification** tasks please select ``Text Classification``.
 
 <div align="center">
-     <img src=https://user-images.githubusercontent.com/40840292/199661638-48a870eb-a1df-4db5-82b9-bc8e985f5190.png height=350 width=1200 />
+     <img src=https://user-images.githubusercontent.com/40840292/212617773-34534e68-4544-4b24-8f39-ae7f9573d397.png height=420 width=1200 />
 </div>
 
 - Define labels
 
@@ -1,3 +1,5 @@
+简体中文 | [English](README_en.md)
+
 # 零样本文本分类
 
 **目录**
@@ -27,7 +29,7 @@
 **零样本文本分类应用亮点：**
 
 - **覆盖场景全面🎓：**  覆盖文本分类各类主流任务，支持多任务训练，满足开发者多样文本分类落地需求。
-- **效果领先🏃：**  具有突出分类效果的UTC模型作为训练基座，提供良好的零样本和小样本学习能力。
+- **效果领先🏃：**  具有突出分类效果的UTC模型作为训练基座，提供良好的零样本和小样本学习能力。该模型在[ZeroCLUE](https://www.cluebenchmarks.com/zeroclue.html)和[FewCLUE](https://www.cluebenchmarks.com/fewclue.html)均取得榜首（截止2023年1月11日）。
 - **简单易用：** 通过Taskflow实现三行代码可实现无标注数据的情况下进行快速调用，一行命令即可开启文本分类，轻松完成部署上线，降低多任务文本分类落地门槛。
 - **高效调优✊：** 开发者无需机器学习背景知识，即可轻松上手数据标注及模型训练流程。
 
@@ -114,7 +116,8 @@ python run_train.py  \
     --disable_tqdm True \
     --metric_for_best_model macro_f1 \
     --load_best_model_at_end  True \
-    --save_total_limit 1
+    --save_total_limit 1 \
+    --save_plm
 ```
 
 如果在GPU环境中使用，可以指定gpus参数进行多卡训练：
@@ -143,7 +146,8 @@ python -u -m paddle.distributed.launch --gpus "0,1" run_train.py \
     --disable_tqdm True \
     --metric_for_best_model macro_f1 \
     --load_best_model_at_end  True \
-    --save_total_limit 1
+    --save_total_limit 1 \
+    --save_plm
 ```
 
 该示例代码中由于设置了参数 `--do_eval`，因此在训练完会自动进行评估。
@@ -156,7 +160,9 @@ python -u -m paddle.distributed.launch --gpus "0,1" run_train.py \
 * `seed`：全局随机种子，默认为 42。
 * `model_name_or_path`：进行 few shot 训练使用的预训练模型。默认为 "utc-large"。
 * `output_dir`：必须，模型训练或压缩后保存的模型目录；默认为 `None` 。
-* `dev_path`：开发集路径；默认为 `None` 。
+* `dataset_path`：数据集文件所在目录；默认为 `./data/` 。
+* `train_file`：训练集后缀；默认为 `train.txt` 。
+* `dev_file`：开发集后缀；默认为 `dev.txt` 。
 * `max_seq_len`：文本最大切分长度，包括标签的输入超过最大长度时会对输入文本进行自动切分，标签部分不可切分，默认为512。
 * `per_device_train_batch_size`:用于训练的每个 GPU 核心/CPU 的batch大小，默认为8。
 * `per_device_eval_batch_size`:用于评估的每个 GPU 核心/CPU 的batch大小，默认为8。
@@ -204,7 +210,7 @@ python run_eval.py \
 >>> from pprint import pprint
 >>> from paddlenlp import Taskflow
 >>> schema = ["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议", "疾病表述", "后果表述", "注意事项", "功效作用", "医疗费用", "其他"]
->>> my_cls = Taskflow("zero_shot_text_classification", schema=schema, task_path='./checkpoint/model_best', precision="fp16")
+>>> my_cls = Taskflow("zero_shot_text_classification", schema=schema, task_path='./checkpoint/model_best/plm', precision="fp16")
 >>> pprint(my_cls("中性粒细胞比率偏低"))
 ```
 
@@ -221,7 +227,7 @@ from paddlenlp import SimpleServer, Taskflow
 schema = ["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议"]
 utc = Taskflow("zero_shot_text_classification",
                schema=schema,
-               task_path="../../checkpoint/model_best/",
+               task_path="../../checkpoint/model_best/plm",
                precision="fp32")
 app = SimpleServer()
 app.register_taskflow("taskflow/utc", utc)