π€ AI Summary
Traditional sequence models struggle to capture long-range dependencies and biologically relevant structural features in DNA, limiting their effectiveness in gene function prediction and regulatory mechanism inference. To address this, we propose the first biology-informed foundation model for long DNA sequences. Our approach innovatively incorporates Groove Fusion to encode DNAβs 3D groove geometry, gated reverse-complement (GRC) modeling to explicitly represent double-stranded symmetry, and integrates multi-scale attention with an evolutionary training strategy for unified prokaryotic and eukaryotic genome modeling. We concurrently release the first long-sequence DNA benchmark dataset specifically designed for coding sequence (CDS) annotation. On gene function prediction and regulatory element identification tasks, our model significantly outperforms state-of-the-art methods, achieving superior accuracy and generalization across diverse genomic contexts. This work advances the practical deployment of long-sequence genomic foundation models.
π Abstract
The modeling of genomic sequences presents unique challenges due to their length and structural complexity. Traditional sequence models struggle to capture long-range dependencies and biological features inherent in DNA. In this work, we propose TrinityDNA, a novel DNA foundational model designed to address these challenges. The model integrates biologically informed components, including Groove Fusion for capturing DNA's structural features and Gated Reverse Complement (GRC) to handle the inherent symmetry of DNA sequences. Additionally, we introduce a multi-scale attention mechanism that allows the model to attend to varying levels of sequence dependencies, and an evolutionary training strategy that progressively adapts the model to both prokaryotic and eukaryotic genomes. TrinityDNA provides a more accurate and efficient approach to genomic sequence modeling, offering significant improvements in gene function prediction, regulatory mechanism discovery, and other genomics applications. Our model bridges the gap between machine learning techniques and biological insights, paving the way for more effective analysis of genomic data. Additionally, we introduced a new DNA long-sequence CDS annotation benchmark to make evaluations more comprehensive and oriented toward practical applications.