TrinityDNA: A Bio-Inspired Foundational Model for Efficient Long-Sequence DNA Modeling

📅 2025-07-25

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Traditional sequence models struggle to capture long-range dependencies and biologically relevant structural features in DNA, limiting their effectiveness in gene function prediction and regulatory mechanism inference. To address this, we propose the first biology-informed foundation model for long DNA sequences. Our approach innovatively incorporates Groove Fusion to encode DNA’s 3D groove geometry, gated reverse-complement (GRC) modeling to explicitly represent double-stranded symmetry, and integrates multi-scale attention with an evolutionary training strategy for unified prokaryotic and eukaryotic genome modeling. We concurrently release the first long-sequence DNA benchmark dataset specifically designed for coding sequence (CDS) annotation. On gene function prediction and regulatory element identification tasks, our model significantly outperforms state-of-the-art methods, achieving superior accuracy and generalization across diverse genomic contexts. This work advances the practical deployment of long-sequence genomic foundation models.

Technology Category

Application Category

📝 Abstract

The modeling of genomic sequences presents unique challenges due to their length and structural complexity. Traditional sequence models struggle to capture long-range dependencies and biological features inherent in DNA. In this work, we propose TrinityDNA, a novel DNA foundational model designed to address these challenges. The model integrates biologically informed components, including Groove Fusion for capturing DNA's structural features and Gated Reverse Complement (GRC) to handle the inherent symmetry of DNA sequences. Additionally, we introduce a multi-scale attention mechanism that allows the model to attend to varying levels of sequence dependencies, and an evolutionary training strategy that progressively adapts the model to both prokaryotic and eukaryotic genomes. TrinityDNA provides a more accurate and efficient approach to genomic sequence modeling, offering significant improvements in gene function prediction, regulatory mechanism discovery, and other genomics applications. Our model bridges the gap between machine learning techniques and biological insights, paving the way for more effective analysis of genomic data. Additionally, we introduced a new DNA long-sequence CDS annotation benchmark to make evaluations more comprehensive and oriented toward practical applications.

Problem

Research questions and friction points this paper is trying to address.

Modeling long genomic sequences with structural complexity

Capturing long-range dependencies and biological DNA features

Improving gene prediction and regulatory mechanism discovery

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bio-inspired model with Groove Fusion

Multi-scale attention for sequence dependencies

Evolutionary training for diverse genomes

🔎 Similar Papers

No similar papers found.

Authors to Follow