🧪 FactTest: Factuality Testing in Large Language Models with Finite-Sample and Distribution-Free Guarantees
Fan Nie | Xiaotian Hou | Shuhang Lin | James Zou | Huaxiu Yao | Linjun Zhang |
- 🎉 May 27, 2025: Source code released!
- 🎉 May 28, 2025: Upload all four datasets!
This repository provides tools for testing factuality in Large Language Models with statistical guarantees. Follow the steps below to get started with ParaRel as an example.
# Clone the repository
git clone https://github.com/fannie1208/FactTest.git
cd FactTest
pip install -r requirements.txt
Navigate to the calibration directory:
cd calibration/pararel
python collect_dataset.py --model openlm-research/open_llama_3b
For Vanilla Entropy Score Function:
Initial calibration run (computes and saves scores):
python calculate_vanilla_threshold.py \
--model openlm-research/open_llama_3b \
--alpha 0.05 \
--num_try 15
Reusing Saved Scores:
After the initial run, you can quickly calculate thresholds for different alpha values using the stored scores:
python calculate_vanilla_threshold.py \
--model openlm-research/open_llama_3b \
--alpha 0.1 \
--stored \
--num_try 15
The --stored
flag allows you to experiment with different significance levels without re-running the expensive model evaluation.
cd evaluation/pararel
python evaluate_vanilla.py \
--model openlm-research/open_llama_3b \
--num_try 15
After evaluation, compute the metrics using:
python eval.py \
--model openlm-research/open_llama_3b \
--num_try 15 \
--method vanilla \
--tau <your_threshold>
💡 Note: Replace
<your_threshold>
with the threshold value obtained from the calibration step.
If you find this work useful, please cite our paper:
@misc{nie2024facttest,
title={FactTest: Factuality Testing in Large Language Models with Finite-Sample and Distribution-Free Guarantees},
author={Fan Nie and Xiaotian Hou and Shuhang Lin and James Zou and Huaxiu Yao and Linjun Zhang},
year={2024},
eprint={2411.02603},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2411.02603},
}