Skip to content

[ICML2025] "FactTest: Factuality Testing in Large Language Models with Finite-Sample and Distribution-Free Guarantees"

Notifications You must be signed in to change notification settings

fannie1208/FactTest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧪 FactTest: Factuality Testing in Large Language Models with Finite-Sample and Distribution-Free Guarantees

arXiv License


👥 Authors

Fan Nie Xiaotian Hou Shuhang Lin James Zou Huaxiu Yao Linjun Zhang

📰 News

  • 🎉 May 27, 2025: Source code released!
  • 🎉 May 28, 2025: Upload all four datasets!

🚀 Quick Start

This repository provides tools for testing factuality in Large Language Models with statistical guarantees. Follow the steps below to get started with ParaRel as an example.

📋 Prerequisites

# Clone the repository
git clone https://github.com/fannie1208/FactTest.git
cd FactTest

pip install -r requirements.txt

🔧 Usage

🎯 Step 1: Calibration

Navigate to the calibration directory:

cd calibration/pararel

📊 Calibration Dataset Construction

python collect_dataset.py --model openlm-research/open_llama_3b

🎚️ Calibration and Threshold Selection

For Vanilla Entropy Score Function:

Initial calibration run (computes and saves scores):

python calculate_vanilla_threshold.py \
    --model openlm-research/open_llama_3b \
    --alpha 0.05 \
    --num_try 15

Reusing Saved Scores:

After the initial run, you can quickly calculate thresholds for different alpha values using the stored scores:

python calculate_vanilla_threshold.py \
    --model openlm-research/open_llama_3b \
    --alpha 0.1 \
    --stored \
    --num_try 15

The --stored flag allows you to experiment with different significance levels without re-running the expensive model evaluation.

📈 Step 2: Evaluation

cd evaluation/pararel
python evaluate_vanilla.py \
    --model openlm-research/open_llama_3b \
    --num_try 15

📊 Calculate Evaluation Metrics

After evaluation, compute the metrics using:

python eval.py \
    --model openlm-research/open_llama_3b \
    --num_try 15 \
    --method vanilla \
    --tau <your_threshold>

💡 Note: Replace <your_threshold> with the threshold value obtained from the calibration step.


📖 Citation

If you find this work useful, please cite our paper:

@misc{nie2024facttest,
      title={FactTest: Factuality Testing in Large Language Models with Finite-Sample and Distribution-Free Guarantees}, 
      author={Fan Nie and Xiaotian Hou and Shuhang Lin and James Zou and Huaxiu Yao and Linjun Zhang},
      year={2024},
      eprint={2411.02603},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.02603}, 
}

About

[ICML2025] "FactTest: Factuality Testing in Large Language Models with Finite-Sample and Distribution-Free Guarantees"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages