Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models

📅 2024-10-02

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

218K/year

🤖 AI Summary

To address the challenge of interpretable phenotype prediction from high-dimensional genotype data under few-shot settings, this paper proposes FREEFORM—a knowledge-driven framework. FREEFORM leverages intrinsic biomedical knowledge encoded in pre-trained large language models (e.g., LLaMA, BioMedLM) via prompt engineering and chain-of-thought reasoning to guide variant selection and multi-path feature construction. It further introduces an ensemble evaluation mechanism for knowledge-guided, structured feature engineering on genotype data. Unlike conventional purely data-driven approaches, FREEFORM explicitly integrates domain knowledge into the feature learning pipeline. Evaluated on two real-world datasets—genetic ancestry and hereditary hearing loss—FREEFORM achieves substantial improvements over state-of-the-art methods, with AUC gains of up to 12.3% in low-sample regimes. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Predicting phenotypes with complex genetic bases based on a small, interpretable set of variant features remains a challenging task. Conventionally, data-driven approaches are utilized for this task, yet the high dimensional nature of genotype data makes the analysis and prediction difficult. Motivated by the extensive knowledge encoded in pre-trained LLMs and their success in processing complex biomedical concepts, we set to examine the ability of LLMs in feature selection and engineering for tabular genotype data, with a novel knowledge-driven framework. We develop FREEFORM, Free-flow Reasoning and Ensembling for Enhanced Feature Output and Robust Modeling, designed with chain-of-thought and ensembling principles, to select and engineer features with the intrinsic knowledge of LLMs. Evaluated on two distinct genotype-phenotype datasets, genetic ancestry and hereditary hearing loss, we find this framework outperforms several data-driven methods, particularly on low-shot regimes. FREEFORM is available as open-source framework at GitHub: https://github.com/PennShenLab/FREEFORM.

Problem

Research questions and friction points this paper is trying to address.

Predicting phenotypes from complex genetic bases using few interpretable features

Overcoming high dimensionality challenges in genotype data analysis

Leveraging LLMs for knowledge-driven feature selection and engineering

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based knowledge-driven feature selection

Chain-of-thought and ensembling principles

Open-source FREEFORM framework for genotypes

🔎 Similar Papers

GP-GPT: Large Language Model for Gene-Phenotype Mapping