Skip to content

PennShenLab/FREEFORM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FREEFORM: Free-flow Reasoning and Ensembling for Enhanced Feature Output and Robust Modeling

This repository holds the official code for the paper Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models in which we represent the FREEFORM framework.

alt text

🎯 Abstract

Predicting phenotypes with complex genetic bases based on a small, interpretable set of variant features remains a challenging task. Conventionally, data-driven approaches are utilized for this task, yet the high dimensional nature of genotype data makes the analysis and prediction difficult. Motivated by the extensive knowledge encoded in pre-trained LLMs and their success in processing complex biomedical concepts, we set to examine the ability of LLMs in feature selection and engineering for tabular genotype data, with a novel knowledge-driven framework. We develop FREEFORM, Free-flow Reasoning and Ensembling for Enhanced Feature Output and Robust Modeling, designed with chain-of-thought and ensembling principles, to select and engineer features with the intrinsic knowledge of LLMs. Evaluated on two distinct genotype-phenotype datasets, genetic ancestry and hereditary hearing loss, we find this framework outperforms several data-driven methods, particularly on low-shot regimes. FREEFORM is available as open-source framework at GitHub.

📝 Requiremnets

The algorithm is implemented in Python. To install the related packages, use

conda env create -f environment.yml
conda activate freeform

🔨 Usage

To use our framework, look to the demonstration.ipynb notebook for an example pipeline of the functions defined in utils.py, utils_selection.py, utils_engineering.py. To replicate our results, you may refer to the notebooks with evaluation in the filename.

🤝 Acknowledgements

This work was supported in part by the NIH grants U01 AG066833, U01 AG068057, R01 AG071470, U19 AG074879, and S10 OD023495.

📭 Maintainers

📚 Citation

@article{FreeForm,
      title={Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models}, 
      author={Joseph Lee and Shu Yang and Jae Young Baik and Xiaoxi Liu and Zhen Tan and Dawei Li and Zixuan Wen and Bojian Hou and Duy Duong-Tran and Tianlong Chen and Li Shen},
      year={2025},
      journal={AMIA Informatics Summit},
}

About

FREEFORM | Knowledge-Driven Feature Selection and Engineering with Large Language Models

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •