🤖 AI Summary
Metagenomic gene functional modeling faces challenges including weak k-mer representation capacity, ambiguous gene–function mappings, and difficulty in modeling one-to-many and many-to-one relationships. To address these, we propose the first protein-function-driven pre-trained gene language model. Our method introduces a novel protein-aware gene tokenizer, masked gene modeling (MGM), and triplet-enhanced contrastive learning (TMC), enabling the first joint representation of gene sequences, structures, and functions. Built upon the Transformer architecture, the model integrates bioinformatic priors and is pre-trained on large-scale metagenomic sequences. It achieves state-of-the-art performance across four hierarchical tasks—gene-level, functional annotation, species classification, and environmental context prediction. Biologically, the model successfully identifies ATP synthase complexes and resolves operon structures, demonstrating strong interpretability and practical utility in functional genomics.
📝 Abstract
Metagenomic data, comprising mixed multi-species genomes, are prevalent in diverse environments like oceans and soils, significantly impacting human health and ecological functions. However, current research relies on K-mer, which limits the capture of structurally and functionally relevant gene contexts. Moreover, these approaches struggle with encoding biologically meaningful genes and fail to address the One-to-Many and Many-to-One relationships inherent in metagenomic data. To overcome these challenges, we introduce FGBERT, a novel metagenomic pre-trained model that employs a protein-based gene representation as a context-aware and structure-relevant tokenizer. FGBERT incorporates Masked Gene Modeling (MGM) to enhance the understanding of inter-gene contextual relationships and Triplet Enhanced Metagenomic Contrastive Learning (TMC) to elucidate gene sequence-function relationships. Pre-trained on over 100 million metagenomic sequences, FGBERT demonstrates superior performance on metagenomic datasets at four levels, spanning gene, functional, bacterial, and environmental levels and ranging from 1k to 213k input sequences. Case studies of ATP Synthase and Gene Operons highlight FGBERT's capability for functional recognition and its biological relevance in metagenomic research.