🤖 AI Summary
To address the challenge of interpretable phenotype prediction from high-dimensional genotype data under few-shot settings, this paper proposes FREEFORM—a knowledge-driven framework. FREEFORM leverages intrinsic biomedical knowledge encoded in pre-trained large language models (e.g., LLaMA, BioMedLM) via prompt engineering and chain-of-thought reasoning to guide variant selection and multi-path feature construction. It further introduces an ensemble evaluation mechanism for knowledge-guided, structured feature engineering on genotype data. Unlike conventional purely data-driven approaches, FREEFORM explicitly integrates domain knowledge into the feature learning pipeline. Evaluated on two real-world datasets—genetic ancestry and hereditary hearing loss—FREEFORM achieves substantial improvements over state-of-the-art methods, with AUC gains of up to 12.3% in low-sample regimes. The implementation is publicly available.
📝 Abstract
Predicting phenotypes with complex genetic bases based on a small, interpretable set of variant features remains a challenging task. Conventionally, data-driven approaches are utilized for this task, yet the high dimensional nature of genotype data makes the analysis and prediction difficult. Motivated by the extensive knowledge encoded in pre-trained LLMs and their success in processing complex biomedical concepts, we set to examine the ability of LLMs in feature selection and engineering for tabular genotype data, with a novel knowledge-driven framework. We develop FREEFORM, Free-flow Reasoning and Ensembling for Enhanced Feature Output and Robust Modeling, designed with chain-of-thought and ensembling principles, to select and engineer features with the intrinsic knowledge of LLMs. Evaluated on two distinct genotype-phenotype datasets, genetic ancestry and hereditary hearing loss, we find this framework outperforms several data-driven methods, particularly on low-shot regimes. FREEFORM is available as open-source framework at GitHub: https://github.com/PennShenLab/FREEFORM.