GP-GPT: Large Language Model for Gene-Phenotype Mapping

📅 2024-09-15

🏛️ arXiv.org

📈 Citations: 7

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This study addresses the challenges of modeling genotype–phenotype relationships and integrating heterogeneous multi-omic genomic data in biomedicine. We introduce GP-LLM, the first large language model (LLM) specifically designed for gene–phenotype mapping. Built upon the Llama architecture, GP-LLM employs a two-stage supervised fine-tuning paradigm to systematically characterize the evolutionary dynamics of biological factor representations within the model. Its training corpus comprises over 3 million tokens from genomic, proteomic, and medical genetics terminologies, harmonized from authoritative databases and peer-reviewed literature. On genomic information retrieval and relationship classification tasks, GP-LLM significantly outperforms Llama2, Llama3, and GPT-4, while enabling robust genetic disorder association analysis. This work pioneers the deep adaptation of LLMs to gene–phenotype knowledge representation, establishing an interpretable and scalable semantic analytics framework for precision medicine.

Technology Category

Application Category

📝 Abstract

Pre-trained large language models(LLMs) have attracted increasing attention in biomedical domains due to their success in natural language processing. However, the complex traits and heterogeneity of multi-sources genomics data pose significant challenges when adapting these models to the bioinformatics and biomedical field. To address these challenges, we present GP-GPT, the first specialized large language model for genetic-phenotype knowledge representation and genomics relation analysis. Our model is fine-tuned in two stages on a comprehensive corpus composed of over 3,000,000 terms in genomics, proteomics, and medical genetics, derived from multiple large-scale validated datasets and scientific publications. GP-GPT demonstrates proficiency in accurately retrieving medical genetics information and performing common genomics analysis tasks, such as genomics information retrieval and relationship determination. Comparative experiments across domain-specific tasks reveal that GP-GPT outperforms state-of-the-art LLMs, including Llama2, Llama3 and GPT-4. These results highlight GP-GPT's potential to enhance genetic disease relation research and facilitate accurate and efficient analysis in the fields of genomics and medical genetics. Our investigation demonstrated the subtle changes of bio-factor entities' representations in the GP-GPT, which suggested the opportunities for the application of LLMs to advancing gene-phenotype research.

Problem

Research questions and friction points this paper is trying to address.

Addressing challenges in adapting LLMs to genomics data complexity

Developing specialized model for gene-phenotype knowledge representation

Enhancing accuracy in medical genetics information retrieval tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Specialized LLM for gene-phenotype mapping

Two-stage fine-tuning on genomics corpus

Outperforms state-of-the-art models like GPT-4

🔎 Similar Papers

High-Throughput Phenotyping of Clinical Text Using Large Language Models