EvoLen: Evolution-Guided Tokenization for DNA Language Model

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current DNA language models employ tokenization methods that overlook fundamental biological properties, limiting their ability to capture sequence patterns shaped by functional and evolutionary constraints. This work proposes EvoLen, the first approach to directly integrate cross-species evolutionary conservation into the tokenization process. EvoLen trains multiple BPE tokenizers in an evolutionarily stratified manner and merges their vocabularies based on preserved functional motifs. Coupled with length-aware dynamic programming decoding, this framework yields DNA representations with enhanced biological interpretability. Experimental results demonstrate that EvoLen matches or surpasses standard BPE across multiple benchmarks, significantly improving retention of functional sequences, discrimination of genomic context, and alignment with evolutionary constraints.

Technology Category

Application Category

📝 Abstract
Tokens serve as the basic units of representation in DNA language models (DNALMs), yet their design remains underexplored. Unlike natural language, DNA lacks inherent token boundaries or predefined compositional rules, making tokenization a fundamental modeling decision rather than a naturally specified one. While existing approaches like byte-pair encoding (BPE) excel at capturing token structures that reflect human-generated linguistic regularities, DNA is organized by biological function and evolutionary constraint rather than linguistic convention. We argue that DNA tokenization should prioritize functional sequence patterns like regulatory motifs-short, recurring segments under evolutionary constraint and typically preserved across species. We incorporate evolutionary information directly into the tokenization process through EvoLen, a tokenizer that combines evolutionary stratification with length-aware decoding to better preserve motif-scale functional sequence units. EvoLen uses cross-species evolutionary signals to group DNA sequences, trains separate BPE tokenizers on each group, merges the resulting vocabularies via a rule prioritizing preserved patterns, and applies length-aware decoding with dynamic programming. Through controlled experiments, EvoLen improves the preservation of functional sequence patterns, differentiation across genomic contexts, and alignment with evolutionary constraint, while matching or outperforming standard BPE across diverse DNALM benchmarks. These results demonstrate that tokenization introduces a critical inductive bias and that incorporating evolutionary information yields more biologically meaningful and interpretable sequence representations.
Problem

Research questions and friction points this paper is trying to address.

DNA language model
tokenization
evolutionary constraint
regulatory motifs
functional sequence patterns
Innovation

Methods, ideas, or system contributions that make the work stand out.

evolution-guided tokenization
DNA language model
regulatory motifs
evolutionary constraint
length-aware decoding
🔎 Similar Papers
No similar papers found.
N
Nan Huang
University of California, San Diego
X
Xiaoxiao Zhou
Washington University in St. Louis
J
Junxia Cui
Washington University in St. Louis
M
Mario Tapia-Pacheco
University of California, San Diego
T
Tiffany Amariuta
University of California, San Diego
Y
Yang Li
University of California, San Diego
Jingbo Shang
Jingbo Shang
Associate Professor, UC San Diego
Natural Language ProcessingData MiningDeep LearningInformation ExtractionWeak Supervision