Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models

📅 2024-10-02
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of interpretable phenotype prediction from high-dimensional genotype data under few-shot settings, this paper proposes FREEFORM—a knowledge-driven framework. FREEFORM leverages intrinsic biomedical knowledge encoded in pre-trained large language models (e.g., LLaMA, BioMedLM) via prompt engineering and chain-of-thought reasoning to guide variant selection and multi-path feature construction. It further introduces an ensemble evaluation mechanism for knowledge-guided, structured feature engineering on genotype data. Unlike conventional purely data-driven approaches, FREEFORM explicitly integrates domain knowledge into the feature learning pipeline. Evaluated on two real-world datasets—genetic ancestry and hereditary hearing loss—FREEFORM achieves substantial improvements over state-of-the-art methods, with AUC gains of up to 12.3% in low-sample regimes. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Predicting phenotypes with complex genetic bases based on a small, interpretable set of variant features remains a challenging task. Conventionally, data-driven approaches are utilized for this task, yet the high dimensional nature of genotype data makes the analysis and prediction difficult. Motivated by the extensive knowledge encoded in pre-trained LLMs and their success in processing complex biomedical concepts, we set to examine the ability of LLMs in feature selection and engineering for tabular genotype data, with a novel knowledge-driven framework. We develop FREEFORM, Free-flow Reasoning and Ensembling for Enhanced Feature Output and Robust Modeling, designed with chain-of-thought and ensembling principles, to select and engineer features with the intrinsic knowledge of LLMs. Evaluated on two distinct genotype-phenotype datasets, genetic ancestry and hereditary hearing loss, we find this framework outperforms several data-driven methods, particularly on low-shot regimes. FREEFORM is available as open-source framework at GitHub: https://github.com/PennShenLab/FREEFORM.
Problem

Research questions and friction points this paper is trying to address.

Predicting phenotypes from complex genetic bases using few interpretable features
Overcoming high dimensionality challenges in genotype data analysis
Leveraging LLMs for knowledge-driven feature selection and engineering
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based knowledge-driven feature selection
Chain-of-thought and ensembling principles
Open-source FREEFORM framework for genotypes
🔎 Similar Papers
No similar papers found.
J
Joseph Lee
Unversity of Pennsylvania, Philadelphia, USA
S
Shu Yang
Unversity of Pennsylvania, Philadelphia, USA
Jae Young Baik
Jae Young Baik
Undergraduate Researcher, University of Pennsylvania
Xiaoxi Liu
Xiaoxi Liu
RIKEN, Yokohama, Japan
Zhen Tan
Zhen Tan
Ph.D. at Arizona State University
Data MiningMachine LearningAI for ScienceUser-centric ExplanationResponsible AI
D
Dawei Li
Arizona State University, Tempe, USA
Z
Zixuan Wen
Unversity of Pennsylvania, Philadelphia, USA
Bojian Hou
Bojian Hou
Meta
Machine LearningArtificial IntelligenceTrustworthy (Gen)AILarge Language ModelHealthTech
D
D. Duong-Tran
United States Naval Academy, Annapolis, USA
Tianlong Chen
Tianlong Chen
Assistant Professor, CS@UNC Chapel Hill; Chief AI Scientist, hireEZ
Machine LearningAI4ScienceComputer VisionSparsity
L
Li Shen
Unversity of Pennsylvania, Philadelphia, USA