This repo provides the source code & data of our paper: ChatCoT: Tool-Augmented Chain-of-Thought Reasoning on Chat-based Large Language Models (Arxiv 2023).
@InProceedings{Chen-ChatCot-2023,
title = {ChatCoT: Tool-Augmented Chain-of-Thought Reasoning on Chat-based Large Language Models},
author = {Zhipeng Chen and Kun Zhou and Beichen Zhang and Zheng Gong and Wayne Xin Zhao and Ji-Rong Wen},
year = {2023},
eprint = {2305.14323},
archivePrefix = {arXiv},
primaryClass = {cs.CL}
}
math/: the code about ChatCoT on MATH datasetdemo/: the demos used in in-context learning on few-shot settingmath/result/: the result files of different methodsmath/scripts/: the running scripts of different methodsmath/ablation: the code of ablation studymath/self_consistency: the code to explore combining ChatCoT with CoT improvement strategies
The hotpotqa/ folder is similar with math/ folder
You can use following scripts to install related python package through pip:
git clone https://github.com/RUCAIBox/ChatCoT.git
cd ChatCoT
pip install -r requirements.txt
You can run ChatCot on the sub-task of MATH dataset by running run_turbo_chatcot.sh:
cd math
bash scripts/run_turbo_chatcot.sh
You have to replace YOUR_API_KEY with you openai api key in the code. Specially, we run ChatCoT through multi-processing, and you should prepare a list of api key in order to run the code correctly.
You can evaluate the results by running eval.sh:
cd math
bash scripts/eval.sh
| Methods | Algebra | CP | PC | PA | Geometry | IA | NT |
|---|---|---|---|---|---|---|---|
| CoT | 48.10 | 31.43 | 21.06 | 56.60 | 22.34 | 18.27 | 29.07 |
| CoT w/ Tool | 35.89 | 22.57 | 9.34 | 40.53 | 13.57 | 9.41 | 19.44 |
| CoT w/ Retri | 52.74 | 32.70 | 18.86 | 58.44 | 29.23 | 19.93 | 31.67 |
| ChatCoT | 56.11 | 34.18 | 23.81 | 59.24 | 29.85 | 19.49 | 32.59 |
| Methods | HotpotQA |
|---|---|
| CoT | 37.99 |
| CoT w/ Tool | 31.42 |
| ChatCoT w/o Feedback | 53.79 |
| ChatCoT | 59.16 |
| Methods | PC | Geo | NT |
|---|---|---|---|
| ChatCoT | 23.81 | 29.85 | 32.59 |
| ChatCoT w/o TK | 23.26 | 29.23 | 30.56 |
| ChatCoT w/o RATK | 19.96 | 27.35 | 30.93 |
| ChatCoT w/o MRF | 21.61 | 24.22 | 32.22 |
The results of ablation study. TK, RATK, and MRF denote if using tool knowledge, retrieval-augmented task knowledge, and multi-turn reasoning format at early turns of the conversation, respectively.
| Methods | CP | NT |
|---|---|---|
| CoT | 31.43 | 29.07 |
| CoT + SC | 35.23 | 34.44 |
| ChatCoT | 34.18 | 32.59 |
| ChatCoT + SC | 40.08 | 38.33 |
