MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging

📅 2025-11-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Genomic sequence modeling faces two fundamental challenges: highly non-uniform information density and the absence of natural minimal lexical units, rendering conventional single-base or static DNA tokenization approaches ill-suited to genomic structural complexity. To address this, we propose a unified framework integrating dynamic tokenization with context-aware pretraining. We introduce differentiable Token Merging—the first such application in genomics—to enable adaptive base-level aggregation. We further design a latent-variable Transformer with hierarchical attention, jointly enforcing local window constraints and global contextual modeling to support selective token identification and reconstruction. Our framework jointly optimizes tokenization and representation learning via two end-to-end objectives: merged-token reconstruction and adaptive masked modeling. Evaluated across three major DNA benchmarks and multi-omics tasks, our method consistently outperforms state-of-the-art tokenization strategies and large-scale DNA foundation models under both fine-tuning and zero-shot settings.

Technology Category

Application Category

📝 Abstract
Modeling genomic sequences faces two unsolved challenges: the information density varies widely across different regions, while there is no clearly defined minimum vocabulary unit. Relying on either four primitive bases or independently designed DNA tokenizers, existing approaches with naive masked language modeling pre-training often fail to adapt to the varying complexities of genomic sequences. Leveraging Token Merging techniques, this paper introduces a hierarchical architecture that jointly optimizes a dynamic genomic tokenizer and latent Transformers with context-aware pre-training tasks. As for network structures, the tokenization module automatically chunks adjacent bases into words by stacking multiple layers of the differentiable token merging blocks with local-window constraints, then a Latent Encoder captures the global context of these merged words by full-attention blocks. Symmetrically employing a Latent Decoder and a Local Decoder, MergeDNA learns with two pre-training tasks: Merged Token Reconstruction simultaneously trains the dynamic tokenization module and adaptively filters important tokens, while Adaptive Masked Token Modeling learns to predict these filtered tokens to capture informative contents. Extensive experiments show that MergeDNA achieves superior performance on three popular DNA benchmarks and several multi-omics tasks with fine-tuning or zero-shot evaluation, outperforming typical tokenization methods and large-scale DNA foundation models.
Problem

Research questions and friction points this paper is trying to address.

Modeling genomic sequences with varying information density across regions
Addressing undefined minimum vocabulary units in DNA tokenization
Adapting to varying complexities beyond naive masked language modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic tokenization adapts to genomic information density
Hierarchical architecture jointly optimizes tokenizer and Transformers
Context-aware pre-training tasks filter and reconstruct tokens
🔎 Similar Papers
No similar papers found.
S
Siyuan Li
Zhejiang University, Hangzhou, China
K
Kai Yu
AI Lab, Research Center for Industries of the Future, Westlake University, China
A
Anna Wang
AI Lab, Research Center for Industries of the Future, Westlake University, China
Z
Zicheng Liu
Zhejiang University, Hangzhou, China
C
Chang Yu
AI Lab, Research Center for Industries of the Future, Westlake University, China
J
Jingbo Zhou
Zhejiang University, Hangzhou, China
Q
Qirong Yang
BioMap Research, Beijing, China
Yucheng Guo
Yucheng Guo
Princeton University
Stochastic AnalysisPartial Differential EquationsMathematical Finance
X
Xiaoming Zhang
BioMap Research, Beijing, China
S
Stan Z. Li
AI Lab, Research Center for Industries of the Future, Westlake University, China