🧪 FactTest: Factuality Testing in Large Language Models with Finite-Sample and Distribution-Free Guarantees

👥 Authors

Fan Nie

Xiaotian Hou

Shuhang Lin

James Zou

Huaxiu Yao

Linjun Zhang

📰 News

🎉 May 27, 2025: Source code released!
🎉 May 28, 2025: Upload all four datasets!

🚀 Quick Start

This repository provides tools for testing factuality in Large Language Models with statistical guarantees. Follow the steps below to get started with ParaRel as an example.

📋 Prerequisites

# Clone the repository
git clone https://github.com/fannie1208/FactTest.git
cd FactTest

pip install -r requirements.txt

🔧 Usage

🎯 Step 1: Calibration

Navigate to the calibration directory:

cd calibration/pararel

📊 Calibration Dataset Construction

python collect_dataset.py --model openlm-research/open_llama_3b

🎚️ Calibration and Threshold Selection

For Vanilla Entropy Score Function:

Initial calibration run (computes and saves scores):

python calculate_vanilla_threshold.py \
    --model openlm-research/open_llama_3b \
    --alpha 0.05 \
    --num_try 15

Reusing Saved Scores:

After the initial run, you can quickly calculate thresholds for different alpha values using the stored scores:

python calculate_vanilla_threshold.py \
    --model openlm-research/open_llama_3b \
    --alpha 0.1 \
    --stored \
    --num_try 15

The --stored flag allows you to experiment with different significance levels without re-running the expensive model evaluation.

📈 Step 2: Evaluation

cd evaluation/pararel
python evaluate_vanilla.py \
    --model openlm-research/open_llama_3b \
    --num_try 15

📊 Calculate Evaluation Metrics

After evaluation, compute the metrics using:

python eval.py \
    --model openlm-research/open_llama_3b \
    --num_try 15 \
    --method vanilla \
    --tau <your_threshold>

💡 Note: Replace <your_threshold> with the threshold value obtained from the calibration step.

📖 Citation

If you find this work useful, please cite our paper:

@misc{nie2024facttest,
      title={FactTest: Factuality Testing in Large Language Models with Finite-Sample and Distribution-Free Guarantees}, 
      author={Fan Nie and Xiaotian Hou and Shuhang Lin and James Zou and Huaxiu Yao and Linjun Zhang},
      year={2024},
      eprint={2411.02603},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.02603}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧪 FactTest: Factuality Testing in Large Language Models with Finite-Sample and Distribution-Free Guarantees

👥 Authors

📰 News

🚀 Quick Start

📋 Prerequisites

🔧 Usage

🎯 Step 1: Calibration

📊 Calibration Dataset Construction

🎚️ Calibration and Threshold Selection

📈 Step 2: Evaluation

📊 Calculate Evaluation Metrics

📖 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
calibration		calibration
dataset		dataset
evaluation		evaluation
README.md		README.md
requirements.txt		requirements.txt

fannie1208/FactTest

Folders and files

Latest commit

History

Repository files navigation

🧪 FactTest: Factuality Testing in Large Language Models with Finite-Sample and Distribution-Free Guarantees

👥 Authors

📰 News

🚀 Quick Start

📋 Prerequisites

🔧 Usage

🎯 Step 1: Calibration

📊 Calibration Dataset Construction

🎚️ Calibration and Threshold Selection

📈 Step 2: Evaluation

📊 Calculate Evaluation Metrics

📖 Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages