FGBERT: Function-Driven Pre-trained Gene Language Model for Metagenomics

📅 2024-02-24

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Metagenomic gene functional modeling faces challenges including weak k-mer representation capacity, ambiguous gene–function mappings, and difficulty in modeling one-to-many and many-to-one relationships. To address these, we propose the first protein-function-driven pre-trained gene language model. Our method introduces a novel protein-aware gene tokenizer, masked gene modeling (MGM), and triplet-enhanced contrastive learning (TMC), enabling the first joint representation of gene sequences, structures, and functions. Built upon the Transformer architecture, the model integrates bioinformatic priors and is pre-trained on large-scale metagenomic sequences. It achieves state-of-the-art performance across four hierarchical tasks—gene-level, functional annotation, species classification, and environmental context prediction. Biologically, the model successfully identifies ATP synthase complexes and resolves operon structures, demonstrating strong interpretability and practical utility in functional genomics.

Technology Category

Application Category

📝 Abstract

Metagenomic data, comprising mixed multi-species genomes, are prevalent in diverse environments like oceans and soils, significantly impacting human health and ecological functions. However, current research relies on K-mer, which limits the capture of structurally and functionally relevant gene contexts. Moreover, these approaches struggle with encoding biologically meaningful genes and fail to address the One-to-Many and Many-to-One relationships inherent in metagenomic data. To overcome these challenges, we introduce FGBERT, a novel metagenomic pre-trained model that employs a protein-based gene representation as a context-aware and structure-relevant tokenizer. FGBERT incorporates Masked Gene Modeling (MGM) to enhance the understanding of inter-gene contextual relationships and Triplet Enhanced Metagenomic Contrastive Learning (TMC) to elucidate gene sequence-function relationships. Pre-trained on over 100 million metagenomic sequences, FGBERT demonstrates superior performance on metagenomic datasets at four levels, spanning gene, functional, bacterial, and environmental levels and ranging from 1k to 213k input sequences. Case studies of ATP Synthase and Gene Operons highlight FGBERT's capability for functional recognition and its biological relevance in metagenomic research.

Problem

Research questions and friction points this paper is trying to address.

Metagenomic Data Analysis

Gene Structure and Function

Inter-gene Interaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

FGBERT

Metagenomic Data

Masked Gene Modeling

🔎 Similar Papers

No similar papers found.