GP-GPT: Large Language Model for Gene-Phenotype Mapping

📅 2024-09-15
🏛️ arXiv.org
📈 Citations: 7
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenges of modeling genotype–phenotype relationships and integrating heterogeneous multi-omic genomic data in biomedicine. We introduce GP-LLM, the first large language model (LLM) specifically designed for gene–phenotype mapping. Built upon the Llama architecture, GP-LLM employs a two-stage supervised fine-tuning paradigm to systematically characterize the evolutionary dynamics of biological factor representations within the model. Its training corpus comprises over 3 million tokens from genomic, proteomic, and medical genetics terminologies, harmonized from authoritative databases and peer-reviewed literature. On genomic information retrieval and relationship classification tasks, GP-LLM significantly outperforms Llama2, Llama3, and GPT-4, while enabling robust genetic disorder association analysis. This work pioneers the deep adaptation of LLMs to gene–phenotype knowledge representation, establishing an interpretable and scalable semantic analytics framework for precision medicine.

Technology Category

Application Category

📝 Abstract
Pre-trained large language models(LLMs) have attracted increasing attention in biomedical domains due to their success in natural language processing. However, the complex traits and heterogeneity of multi-sources genomics data pose significant challenges when adapting these models to the bioinformatics and biomedical field. To address these challenges, we present GP-GPT, the first specialized large language model for genetic-phenotype knowledge representation and genomics relation analysis. Our model is fine-tuned in two stages on a comprehensive corpus composed of over 3,000,000 terms in genomics, proteomics, and medical genetics, derived from multiple large-scale validated datasets and scientific publications. GP-GPT demonstrates proficiency in accurately retrieving medical genetics information and performing common genomics analysis tasks, such as genomics information retrieval and relationship determination. Comparative experiments across domain-specific tasks reveal that GP-GPT outperforms state-of-the-art LLMs, including Llama2, Llama3 and GPT-4. These results highlight GP-GPT's potential to enhance genetic disease relation research and facilitate accurate and efficient analysis in the fields of genomics and medical genetics. Our investigation demonstrated the subtle changes of bio-factor entities' representations in the GP-GPT, which suggested the opportunities for the application of LLMs to advancing gene-phenotype research.
Problem

Research questions and friction points this paper is trying to address.

Addressing challenges in adapting LLMs to genomics data complexity
Developing specialized model for gene-phenotype knowledge representation
Enhancing accuracy in medical genetics information retrieval tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Specialized LLM for gene-phenotype mapping
Two-stage fine-tuning on genomics corpus
Outperforms state-of-the-art models like GPT-4
🔎 Similar Papers
No similar papers found.
Yanjun Lyu
Yanjun Lyu
PhD Student of Computer Science, University of Texas at Arlington
Zihao Wu
Zihao Wu
University of Georgia
Brain-inspired AIArtificial General IntelligenceNLPMedical Image Analysis
L
Lu Zhang
Department of Computer Science and Engineering, The University of Texas at Arlington, Arlington, TX 76015, USA; Department of Computer Science, Indiana University Indianapolis, IN 46202, USA
J
Jing Zhang
Department of Computer Science and Engineering, The University of Texas at Arlington, Arlington, TX 76015, USA
Y
Yiwei Li
School of Computing, University of Georgia, Athens, GA 30602, USA
Wei Ruan
Wei Ruan
University of Georgia
Zhengliang Liu
Zhengliang Liu
University of Georgia
Natural Language ProcessingMedical NLPMedical Image AnalysisData Visualization
X
Xiaowei Yu
Department of Computer Science and Engineering, The University of Texas at Arlington, Arlington, TX 76015, USA
C
Chao Cao
Department of Computer Science and Engineering, The University of Texas at Arlington, Arlington, TX 76015, USA
T
Tong Chen
Department of Computer Science and Engineering, The University of Texas at Arlington, Arlington, TX 76015, USA
Minheng Chen
Minheng Chen
University of Texas at Arlington
Medical Image AnalysisComputational NeuroscienceImage Registration
Z
Zhuang Yan
Department of Computer Science and Engineering, The University of Texas at Arlington, Arlington, TX 76015, USA
X
Xiang Li
Department of Radiology, Massachusetts General Hospital and Harvard Medical School, Boston, MA 02115, USA
R
Rongjie Liu
Department of Statistics, University of Georgia, Athens, GA 30602, USA
C
Chao Huang
Department of Epidemiology & Biostatistics, University of Georgia, GA 30602, USA
W
Wentao Li
Department of Environmental Health Science, University of Georgia, GA 30602, USA
Tianming Liu
Tianming Liu
Distinguished Research Professor of Computer Science, University of Georgia
BrainBrain-Inspired AILLMArtificial General IntelligenceQuantum AI
Dajiang Zhu
Dajiang Zhu
University of Texas at Arlington
Computer ScienceComputational NeuroscienceMedical Imaging