TrinityDNA: A Bio-Inspired Foundational Model for Efficient Long-Sequence DNA Modeling

πŸ“… 2025-07-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Traditional sequence models struggle to capture long-range dependencies and biologically relevant structural features in DNA, limiting their effectiveness in gene function prediction and regulatory mechanism inference. To address this, we propose the first biology-informed foundation model for long DNA sequences. Our approach innovatively incorporates Groove Fusion to encode DNA’s 3D groove geometry, gated reverse-complement (GRC) modeling to explicitly represent double-stranded symmetry, and integrates multi-scale attention with an evolutionary training strategy for unified prokaryotic and eukaryotic genome modeling. We concurrently release the first long-sequence DNA benchmark dataset specifically designed for coding sequence (CDS) annotation. On gene function prediction and regulatory element identification tasks, our model significantly outperforms state-of-the-art methods, achieving superior accuracy and generalization across diverse genomic contexts. This work advances the practical deployment of long-sequence genomic foundation models.

Technology Category

Application Category

πŸ“ Abstract
The modeling of genomic sequences presents unique challenges due to their length and structural complexity. Traditional sequence models struggle to capture long-range dependencies and biological features inherent in DNA. In this work, we propose TrinityDNA, a novel DNA foundational model designed to address these challenges. The model integrates biologically informed components, including Groove Fusion for capturing DNA's structural features and Gated Reverse Complement (GRC) to handle the inherent symmetry of DNA sequences. Additionally, we introduce a multi-scale attention mechanism that allows the model to attend to varying levels of sequence dependencies, and an evolutionary training strategy that progressively adapts the model to both prokaryotic and eukaryotic genomes. TrinityDNA provides a more accurate and efficient approach to genomic sequence modeling, offering significant improvements in gene function prediction, regulatory mechanism discovery, and other genomics applications. Our model bridges the gap between machine learning techniques and biological insights, paving the way for more effective analysis of genomic data. Additionally, we introduced a new DNA long-sequence CDS annotation benchmark to make evaluations more comprehensive and oriented toward practical applications.
Problem

Research questions and friction points this paper is trying to address.

Modeling long genomic sequences with structural complexity
Capturing long-range dependencies and biological DNA features
Improving gene prediction and regulatory mechanism discovery
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bio-inspired model with Groove Fusion
Multi-scale attention for sequence dependencies
Evolutionary training for diverse genomes
πŸ”Ž Similar Papers
No similar papers found.
Q
Qirong Yang
BioMap Research
Yucheng Guo
Yucheng Guo
Princeton University
Stochastic AnalysisPartial Differential EquationsMathematical Finance
Z
Zicheng Liu
BioMap Research; AI Lab, Research Center for Industries of the Future, Westlake University
Y
Yujie Yang
BioMap Research
Qijin Yin
Qijin Yin
BioMap Research
S
Siyuan Li
BioMap Research; AI Lab, Research Center for Industries of the Future, Westlake University
S
Shaomin Ji
BioMap Research
Linlin Chao
Linlin Chao
BioMap Research
X
Xiaoming Zhang
BioMap Research
S
Stan Z. Li
BioMap Research; AI Lab, Research Center for Industries of the Future, Westlake University