ProverGen is a novel framework that synergizes the generative strengths of Large Language Models (LLMs) with the rigor and precision of symbolic provers to create scalable, diverse, and high-quality First-Order Logic (FOL) reasoning datasets.
Using this framework, we have created ProverQA, a challenging FOL reasoning benchmark consisting of 1,500 evaluation instances and 5,000 training instances across three difficulty levels.
- π― Scalable Generation: Automated framework enabling expansion with minimal manual intervention
- π Natural & Diverse Language: Wide range of natural language expressions reflecting real-world linguistic variability
- π¬ Symbolic Validation: Formal symbolic structures validated through Prover9 symbolic prover
- π Faithful Reasoning Chains: Transparent intermediate reasoning steps in both symbolic and natural language formats
- π Complete FOL Coverage: All seven First-Order Logic relationships (β§, β¨, Β¬, β, β, β, β)
- ProverGen: A Framework for First-Order Logic Reasoning Dataset Generation
conda create -n provergen python=3.8
conda activate provergen
git clone https://github.com/opendatalab/ProverGen
cd ./ProverGen
pip install -r requirements.txtThe ProverGen framework leverages Prover9 for logic skeleton generation. Download from their website (select LADR-2009-11A.tar.gz). For installation instructions, refer to the guide.
Linux systems: Since Prover9 is outdated, you might encounter issues with make all. Navigate to LADR-2009-11A/provers.src/Makefile and move all -lm flags to the end of each line:
# Before
$(CC) $(CFLAGS) -lm -o newsax newsax.o $(OBJECTS) ../ladr/libladr.a
# After
$(CC) $(CFLAGS) -o newsax newsax.o $(OBJECTS) ../ladr/libladr.a -lmmacOS systems: You may encounter implicit declaration errors. Fix by:
- In
LADR-2009-11A/mace4.src/select.c: line 236:
// Before
int select_concentric_band(min_id, max_id, max_constrained)
// After
int select_concentric_band(int min_id, int max_id, int max_constrained)- In
LADR-2009-11A/mace4.src/msearch.c: line 850:
// Before
int next_domain_size(n)
// After
int next_domain_size(int n)After installation, update the binary locations in logic_skeleton_generator.py at line 20.
The ProverQA dataset created using this framework is available on:
- π€ Hugging Face: opendatalab/ProverQA
- π OpenDataLab: ProverQA
Dataset Structure:
- Development Set: 1,500 instances (500 easy, 500 medium, 500 hard)
- Training Set: 5,000 instances with data augmentation
- Difficulty Levels: Easy (1-2 steps), Medium (3-5 steps), Hard (6-9 steps)
The generation process consists of three main steps:
Generate the logical structure using symbolic provers with a novel top-down approach.
python3 logic_skeleton_generator.py --mode easy --num 500 --output_dir outputs/logic_dataParameters:
mode: Difficulty level (easy,medium,hard)num: Number of logic skeletons to generateoutput_dir: Output directory
The script also allows customization of the distribution of answers ([True, False, Uncertain]) and the proportion of composite conclusions.
Here are the relevant parameters:
goal_value_probs: Distribution of [True, False, Uncertain] (e.g., [0.4, 0.3, 0.3]).rule_candidate_path: Path to the rule pool file.rule_as_goal_proportion: Proportion of fact vs. rule conclusions (e.g., [0.7, 0.3]).fact_num_threshold: If the fact pool size surpasses this threshold, there's a chance the fact will be provided directly.fact_num_prob: Probability of directly providing a fact.
Convert logic expressions into natural language using LLMs.
python3 logic_skeleton_translator.py \
--model_name meta-llama/Meta-Llama-3.1-70B-Instruct \
--data_dir outputs/logic_data \
--num 100 --start 0 --end 100 \
--output_dir outputs/translated_data \
--mode hard \
--base_url localhost:6417 --api_key EMPTYParameters:
model_name: LLM for translationdata_dir: Path to the logic skeleton files produced in Step 1.num: Total number of logic skeletonsstart/end: Index range for processingoutput_dir: Directory to store the output file.mode: Difficulty levelbase_url/api_key: For local models
Generate complete FOL problems with optional data augmentation.
python3 fol_problem_generator.py \
--model_name meta-llama/Meta-Llama-3.1-70B-Instruct
--filepath outputs/translated_data/hard-100-0_100.json \
--start 0 --end 100 \
--output_dir outputs/final_data \
--mode normal_generation \Parameters:
model_name: LLM for generation.filepath: Path to the translated files produced in Step 2.start/end: Index range for processingoutput_dir: Directory to store the output file.mode: Generation mode (normal_generation,step_augment,uncertain_augment)base_url/api_key: For local modelsnoise1/noise2: Control different types of distractions
Evaluate LLM performance on the generated datasets:
python3 evaluation.py \
--model_name meta-llama/Meta-Llama-3.1-70B-Instruct \
--dataset_name ProverQA --split easy \
--output_dir result/ \
--mode Direct \
--start 0 --end 500
--base_url http://localhost:6417/v1 --api_key EMPTY \Parameters:
model_name: LLM for evaluation.dataset_name: Dataset to evaluate (ProverQA,FOLIO,ProntoQA,ProofWriter)split: Subset (easy,medium,hardfor ProverQA;devfor others)output_dir: Directory to store the output file.mode: Evaluation strategy (DirectorCoT)start/end: Index range for evaluation.base_url/api_key: For local modelstrained_model: Path to fine-tuned model (optional)
Compute metrics using metric.py on the generated result files.
State-of-the-art models show significant room for improvement on ProverQA:
| Model | Easy | Medium | Hard |
|---|---|---|---|
| Proprietary Models | |||
| GPT-4o (CoT) | 94.2% | 79.4% | 50.0% |
| Claude-3.5-Sonnet (CoT) | 95.2% | 83.6% | 56.4% |
| o1-preview-2024-09-12 | 89.8% | 78.8% | 66.2% |
| Open Models | |||
| Llama3.1-70B (CoT) | 90.4% | 73.2% | 46.8% |
| Llama3.1-8B (CoT) | 75.6% | 46.6% | 33.6% |
| Mistral-Large (CoT) | 92.6% | 75.8% | 52.2% |
| DeepSeek-R1 | 91.8% | 78.4% | 66.6% |
Key observations:
- π Challenging Hard Subset: Even top models struggle with 6-9 step reasoning
- π§ Reasoning Models Excel: O1 and DeepSeek-R1 show strong performance on complex problems
- π CoT Helps: Chain-of-thought prompting provides significant improvements
- π Fine-grained Difficulty: Clear performance degradation across difficulty levels
If you use ProverGen or ProverQA in your research, please cite:
@inproceedings{
qi2025large,
title={Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation},
author={Chengwen Qi and Ren Ma and Bowen Li and He Du and Binyuan Hui and Jinwang Wu and Yuanjun Laili and Conghui He},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=C25SgeXWjE}
}- β Open Source: Framework and dataset are publicly available
- π Compliance: Dataset sources follow public repository licenses (MIT for names, WordNet for keywords)
- π€ No Human Data: No human participants involved in data collection
- π‘οΈ Safety: Dataset contains no harmful or biased content
- π Academic Purpose: Designed for research in advancing logical reasoning capabilities
For questions, issues, or collaborations:
- Chengwen Qi: [email protected]
- Ren Ma: [email protected]
- Bowen Li: [email protected]
GitHub Issues: https://github.com/opendatalab/ProverGen/issues
