Advancing bioinformatics with large language models: components, applications and perspectives

📅 2024-01-08
📈 Citations: 7
Influential: 0
📄 PDF
🤖 AI Summary
The application of large language models (LLMs) in bioinformatics remains fragmented, lacking a unified methodological framework across omics domains. Method: We propose the first comprehensive LLM framework spanning multi-omics—including genomics, transcriptomics, proteomics, single-cell biology, and drug discovery—featuring a biologically informed tokenization scheme and pretraining paradigm tailored to biomolecular sequences. Our approach integrates state-of-the-art biomedical foundation models (e.g., ESM, DNABERT, scGPT) within a Transformer architecture, enhanced via self-supervised learning to strengthen sequence representation learning. Contributions: (1) A systematic, taxonomy-based mapping of 100+ LLM-driven bioinformatics applications; (2) identification and analysis of critical bottlenecks—including cross-omics transferability and data sparsity; and (3) a practitioner-oriented, developer–user co-optimization guideline. This work advances reproducible, scalable AI-for-Science deployment in precision medicine and systems biology.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are a class of artificial intelligence models based on deep learning, which have great performance in various tasks, especially in natural language processing (NLP). Large language models typically consist of artificial neural networks with numerous parameters, trained on large amounts of unlabeled input using self-supervised or semi-supervised learning. However, their potential for solving bioinformatics problems may even exceed their proficiency in modeling human language. In this review, we will provide a comprehensive overview of the essential components of large language models (LLMs) in bioinformatics, spanning genomics, transcriptomics, proteomics, drug discovery, and single-cell analysis. Key aspects covered include tokenization methods for diverse data types, the architecture of transformer models, the core attention mechanism, and the pre-training processes underlying these models. Additionally, we will introduce currently available foundation models and highlight their downstream applications across various bioinformatics domains. Finally, drawing from our experience, we will offer practical guidance for both LLM users and developers, emphasizing strategies to optimize their use and foster further innovation in the field.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Bioinformatics
Scientific Research Advancement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models
Bioinformatics
Drug Discovery
🔎 Similar Papers
No similar papers found.
Jiajia Liu
Jiajia Liu
Ant Group
cv multimodal
Mengyuan Yang
Mengyuan Yang
Zhejiang University
Y
Yankai Yu
School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, Sichuan 611756, China
H
Haixia Xu
The Center of Gerontology and Geriatrics, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China; West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
K
Kang Li
West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
X
Xiaobo Zhou
Center for Computational Systems Medicine, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, 77030, USA; McGovern Medical School, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA; School of Dentistry, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA