Biology Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models

📅 2024-12-26

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Large language models (LLMs) exhibit limited understanding and reasoning capabilities over multi-omics sequences (DNA, RNA, proteins), hindering their application in computational biology. Method: We introduce Biology-Instructions—the first instruction-tuning dataset covering all four major biomolecular modalities—and propose ChatMultiOmics, a three-stage training paradigm comprising pretraining, biological alignment, and instruction fine-tuning. We further design a novel framework integrating biological sequence embedding modeling with LLM-adapted evaluation. Contribution/Results: Our approach systematically exposes fundamental limitations of general-purpose LLMs on multi-omics tasks without domain-specific adaptation. Experiments demonstrate substantial improvements in accuracy and cross-modal, cross-task generalization. The open-sourced dataset and baseline models have become a key benchmark resource for multi-omics foundation model research.

Technology Category

Application Category

📝 Abstract

Large language models have already demonstrated their formidable capabilities in general domains, ushering in a revolutionary transformation. However, exploring and exploiting the extensive knowledge of these models to comprehend multi-omics biology remains underexplored. To fill this research gap, we first introduce Biology-Instructions, the first large-scale multi-omics biological sequences-related instruction-tuning dataset including DNA, RNA, proteins, and multi-molecules, designed to bridge the gap between large language models (LLMs) and complex biological sequences-related tasks. This dataset can enhance the versatility of LLMs by integrating diverse biological sequenced-based prediction tasks with advanced reasoning capabilities, while maintaining conversational fluency. Additionally, we reveal significant performance limitations in even state-of-the-art LLMs on biological sequence-related multi-omics tasks without specialized pre-training and instruction-tuning. We further develop a strong baseline called ChatMultiOmics with a novel three-stage training pipeline, demonstrating the powerful ability to understand biology by using Biology-Instructions. Biology-Instructions and ChatMultiOmics are publicly available and crucial resources for enabling more effective integration of LLMs with multi-omics sequence analysis.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Biological Sequence Understanding

Multi-omics Data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Biology-Instructions

ChatMultiOmics

Multi-omics Analysis

🔎 Similar Papers

No similar papers found.

Authors to Follow