Skip to content

chenweixin107/GRATH

Repository files navigation

GRATH: Gradual Self-Truthifying for Large Language Models

This repository is an official implementation of the ICML 2024 paper GRATH: Gradual Self-Truthifying for Large Language Models.

framework

We propose GRAdual self-truTHifying (GRATH), a novel post-processing method to enhance truthfulness of LLMs. GRATH utilizes out-of-domain question prompts to generate pairwise truthfulness training data with each pair containing a question and its correct and incorrect answers, and then optimizes the model via direct preference optimization (DPO) to learn from the truthfulness difference between answer pairs. GRATH iteratively refines truthfulness data and updates the model, leading to a gradual improvement in model truthfulness in a self-supervised manner.

Overview

Run GRATH using Llama2-7B as the pretrained base model

./run_llama2.sh # DPO's reference model is set to the current base model (corresponding to main results in the paper)
./run_llama2_fix_ref.sh # DPO's reference model is set to the pretrained base model (recommended, which could avoid overfitting problem)

Run GRATH using Zephyr-7B as the pretrained base model

./run_zephyr.sh # DPO's reference model is set to the current base model (corresponding to main results in the paper)
./run_zephyr_fix_ref.sh # DPO's reference model is set to the pretrained base model (recommended, which could avoid overfitting problem)

Evaluation

We evaluate the trained models using the backend for 🤗 Hugging Face's Open LLM Leaderboard.

./evaluation.sh 

Details

Creating pairwise truthfulness data

python create_pair_responses.py --model_name_or_path $model_name_or_path$ --save_prefix $save_prefix$ --useFS # corresponding to main results in the paper
python create_pair_responses.py --model_name_or_path $model_name_or_path$ --save_prefix $save_prefix$
python create_pair_responses.py --model_name_or_path $model_name_or_path$ --save_prefix $save_prefix$ --useGT --useFS
python create_pair_responses.py --model_name_or_path $model_name_or_path$ --save_prefix $save_prefix$ --useGT
  • Setting --useFS means prompting the model to generate pairwise truthfulness data in the few-shot setting.

  • Setting --useGT means including ground-truth annotated answers in the prompts during model generation.

  • An example of $save_prefix$: "iter0_arcc".

  • Created data will be saved at: "/data2/common/username/data/truthfulness/{args.save_prefix}_num_{str(args.num_query)}_useGT_{str(args.useGT)}_useFS_{str(args.useFS)}.json"

Combining data

python combine_new_correct_ori_wrong.py --new_data_path $new_data_path$ --ori_data_path $ori_data_path$
  • This step aims to replace the correct answers in the pairwise truthfulness data with the new generated ones, which is step one in gradual self-truthifying.
  • Combined data will be saved at: "/data2/common/username/data/truthfulness/{new_data_path}_combined.json"

Conducting DPO

accelerate launch dpo.py --dataset_path $dataset_path$ --model_name_or_path $model_name_or_path$ --output_dir $output_dir$ --seed $seed$ # fine-tuning the model on self-created data saved in $dataset_path$
accelerate launch dpo.py --dataset_name $dataset_name$ --model_name_or_path $model_name_or_path$ --output_dir $output_dir$ --seed $seed$ # fine-tuning the model on existing HF data whose name is $dataset_name$
  • Fine-tuned model will be saved at: $output_dir$ (an example: "iter_xxx_arcc_num_xxx_useFT_xxx_useFS_xxx/step_xxx")
  • NOTE: This code references the repo—huggingface/trl.

Created data and fine-tuned models

The pairwise truthfulness training data used in our paper is available here.

The truthified models are available here:

  • GRATH-selftruth is a self-truthified model (applied with DPO once). The reference model of DPO is set as the current base model (i.e., the pretrained base model). This model reaches good performance on TruthfulQA MC tasks and remains good generative capability.
  • GRATH-gradtruth is a gradually self-truthified model with one iteration (applied with DPO twice). The reference model of DPO is set as the current base model. This model reaches high performance on TruthfulQA MC tasks efficiently, but its generative capability might decline.
  • GRATH-gradtruth-fixref is a gradually self-truthified model with nine iterations (applied with DPO ten times). The reference model of DPO is set as the pretrained base model to avoid the overfitting problem. This model reaches high performance on TruthfulQA MC tasks and remains good generative capability, but it takes much longer time to train.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published