FGBERT: Function-Driven Pre-trained Gene Language Model for Metagenomics

📅 2024-02-24
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Metagenomic gene functional modeling faces challenges including weak k-mer representation capacity, ambiguous gene–function mappings, and difficulty in modeling one-to-many and many-to-one relationships. To address these, we propose the first protein-function-driven pre-trained gene language model. Our method introduces a novel protein-aware gene tokenizer, masked gene modeling (MGM), and triplet-enhanced contrastive learning (TMC), enabling the first joint representation of gene sequences, structures, and functions. Built upon the Transformer architecture, the model integrates bioinformatic priors and is pre-trained on large-scale metagenomic sequences. It achieves state-of-the-art performance across four hierarchical tasks—gene-level, functional annotation, species classification, and environmental context prediction. Biologically, the model successfully identifies ATP synthase complexes and resolves operon structures, demonstrating strong interpretability and practical utility in functional genomics.

Technology Category

Application Category

📝 Abstract
Metagenomic data, comprising mixed multi-species genomes, are prevalent in diverse environments like oceans and soils, significantly impacting human health and ecological functions. However, current research relies on K-mer, which limits the capture of structurally and functionally relevant gene contexts. Moreover, these approaches struggle with encoding biologically meaningful genes and fail to address the One-to-Many and Many-to-One relationships inherent in metagenomic data. To overcome these challenges, we introduce FGBERT, a novel metagenomic pre-trained model that employs a protein-based gene representation as a context-aware and structure-relevant tokenizer. FGBERT incorporates Masked Gene Modeling (MGM) to enhance the understanding of inter-gene contextual relationships and Triplet Enhanced Metagenomic Contrastive Learning (TMC) to elucidate gene sequence-function relationships. Pre-trained on over 100 million metagenomic sequences, FGBERT demonstrates superior performance on metagenomic datasets at four levels, spanning gene, functional, bacterial, and environmental levels and ranging from 1k to 213k input sequences. Case studies of ATP Synthase and Gene Operons highlight FGBERT's capability for functional recognition and its biological relevance in metagenomic research.
Problem

Research questions and friction points this paper is trying to address.

Metagenomic Data Analysis
Gene Structure and Function
Inter-gene Interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

FGBERT
Metagenomic Data
Masked Gene Modeling
🔎 Similar Papers
No similar papers found.
C
Chenrui Duan
AI Lab, Research Center for Industries of the Future, Westlake University; Zhejiang University
Z
Z. Zang
Zhejiang University
Yongjie Xu
Yongjie Xu
AI Lab, Research Center for Industries of the Future, Westlake University; Zhejiang University
Hang He
Hang He
East China Normal University
AI AgentReinforcement LearningVLMIRLLM4SE
Z
Zihan Liu
AI Lab, Research Center for Industries of the Future, Westlake University; Zhejiang University
S
Siyuan Li
AI Lab, Research Center for Industries of the Future, Westlake University; Zhejiang University
Z
Zijia Song
National University of Defense Technology
Ju-Sheng Zheng
Ju-Sheng Zheng
Zhejiang University
S
Stan Z. Li
Zhejiang University