🤖 AI Summary
This work addresses a fundamental trade-off in genomic foundation models: fixed-vocabulary tokenizers disrupt biologically meaningful motifs, while nucleotide-level models incur prohibitive computational costs. To resolve this, the authors propose dnaHNet, a tokenizer-free autoregressive model that employs a differentiable recursive dynamic chunking mechanism to compress raw DNA sequences into latent tokens in an end-to-end manner. This approach preserves biological semantic integrity while substantially reducing computational complexity, enabling efficient long-context modeling. In pretraining on prokaryotic genomes, dnaHNet outperforms state-of-the-art architectures such as StripedHyena2 and achieves over threefold faster inference. It also demonstrates strong zero-shot performance on protein variant fitness and gene essentiality prediction tasks.
📝 Abstract
Genomic foundation models have the potential to decode DNA syntax, yet face a fundamental tradeoff in their input representation. Standard fixed-vocabulary tokenizers fragment biologically meaningful motifs such as codons and regulatory elements, while nucleotide-level models preserve biological coherence but incur prohibitive computational costs for long contexts. We introduce dnaHNet, a state-of-the-art tokenizer-free autoregressive model that segments and models genomic sequences end-to-end. Using a differentiable dynamic chunking mechanism, dnaHNet compresses raw nucleotides into latent tokens adaptively, balancing compression with predictive accuracy. Pretrained on prokaryotic genomes, dnaHNet outperforms leading architectures including StripedHyena2 in scaling and efficiency. This recursive chunking yields quadratic FLOP reductions, enabling $>3 \times$ inference speedup over Transformers. On zero-shot tasks, dnaHNet achieves superior performance in predicting protein variant fitness and gene essentiality, while automatically discovering hierarchical biological structures without supervision. These results establish dnaHNet as a scalable, interpretable framework for next-generation genomic modeling.