🤖 AI Summary
This study addresses the challenges of modeling genotype–phenotype relationships and integrating heterogeneous multi-omic genomic data in biomedicine. We introduce GP-LLM, the first large language model (LLM) specifically designed for gene–phenotype mapping. Built upon the Llama architecture, GP-LLM employs a two-stage supervised fine-tuning paradigm to systematically characterize the evolutionary dynamics of biological factor representations within the model. Its training corpus comprises over 3 million tokens from genomic, proteomic, and medical genetics terminologies, harmonized from authoritative databases and peer-reviewed literature. On genomic information retrieval and relationship classification tasks, GP-LLM significantly outperforms Llama2, Llama3, and GPT-4, while enabling robust genetic disorder association analysis. This work pioneers the deep adaptation of LLMs to gene–phenotype knowledge representation, establishing an interpretable and scalable semantic analytics framework for precision medicine.
📝 Abstract
Pre-trained large language models(LLMs) have attracted increasing attention in biomedical domains due to their success in natural language processing. However, the complex traits and heterogeneity of multi-sources genomics data pose significant challenges when adapting these models to the bioinformatics and biomedical field. To address these challenges, we present GP-GPT, the first specialized large language model for genetic-phenotype knowledge representation and genomics relation analysis. Our model is fine-tuned in two stages on a comprehensive corpus composed of over 3,000,000 terms in genomics, proteomics, and medical genetics, derived from multiple large-scale validated datasets and scientific publications. GP-GPT demonstrates proficiency in accurately retrieving medical genetics information and performing common genomics analysis tasks, such as genomics information retrieval and relationship determination. Comparative experiments across domain-specific tasks reveal that GP-GPT outperforms state-of-the-art LLMs, including Llama2, Llama3 and GPT-4. These results highlight GP-GPT's potential to enhance genetic disease relation research and facilitate accurate and efficient analysis in the fields of genomics and medical genetics. Our investigation demonstrated the subtle changes of bio-factor entities' representations in the GP-GPT, which suggested the opportunities for the application of LLMs to advancing gene-phenotype research.