DNACHUNKER: Learnable Tokenization for DNA Language Models

📅 2026-01-06

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Current DNA language models are constrained by fixed tokenization strategies, limiting their ability to effectively capture the complex structural patterns of functional genomic regions. To address this, this work proposes DNACHUNKER—a learnable, dynamic DNA tokenization mechanism that adaptively segments sequences into variable-length semantic chunks through masked language modeling pretraining on the human reference genome (hg38). DNACHUNKER introduces, for the first time, an H-Net–based dynamic chunking strategy that adjusts token granularity according to functional importance, employing finer-grained segmentation in critical regions such as promoters and exons to preserve biological detail. This adaptive approach enhances robustness to mutations and positional shifts. Experimental results demonstrate that DNACHUNKER significantly outperforms baseline methods on both the Nucleotide Transformer and Genomic benchmark suites.

Technology Category

Application Category

📝 Abstract

DNA language models are increasingly used to represent genomic sequence, yet their effectiveness depends critically on how raw nucleotides are converted into model inputs. Unlike natural language, DNA offers no canonical boundaries, making fixed tokenizations a brittle design choice under shifts, indels, and local repeats. We introduce \modelname{}, a masked DNA language model that incorporates a learnable adaptive segmentation module to produce context-dependent, variable-length units. Building on a dynamic segmentation procedure, \modelname{} learns to allocate finer granularity to functionally enriched regions while compressing repetitive or redundant sequence. We pre-train \modelname{} on the human reference genome (HG38) and evaluate it on the Nucleotide Transformer and Genomic Benchmarks, where it consistently improves over strong fixed-tokenization baselines. Further analyses and ablations indicate that the learned segmentation is structured rather than incidental: the model preferentially uses shorter units around promoters and exons, and longer units in repetitive regions, yielding representations that are both mutation-resilient and biologically-informed.

Problem

Research questions and friction points this paper is trying to address.

DNA language models

tokenization

sequence segmentation

functional elements

genomic sequences

Innovation

Methods, ideas, or system contributions that make the work stand out.

learnable tokenization

dynamic chunking

DNA language model