DNACHUNKER: Learnable Tokenization for DNA Language Models

📅 2026-01-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current DNA language models are constrained by fixed tokenization strategies, limiting their ability to effectively capture the complex structural patterns of functional genomic regions. To address this, this work proposes DNACHUNKER—a learnable, dynamic DNA tokenization mechanism that adaptively segments sequences into variable-length semantic chunks through masked language modeling pretraining on the human reference genome (hg38). DNACHUNKER introduces, for the first time, an H-Net–based dynamic chunking strategy that adjusts token granularity according to functional importance, employing finer-grained segmentation in critical regions such as promoters and exons to preserve biological detail. This adaptive approach enhances robustness to mutations and positional shifts. Experimental results demonstrate that DNACHUNKER significantly outperforms baseline methods on both the Nucleotide Transformer and Genomic benchmark suites.

Technology Category

Application Category

📝 Abstract
DNA language models are increasingly used to represent genomic sequence, yet their effectiveness depends critically on how raw nucleotides are converted into model inputs. Unlike natural language, DNA offers no canonical boundaries, making fixed tokenizations a brittle design choice under shifts, indels, and local repeats. We introduce \modelname{}, a masked DNA language model that incorporates a learnable adaptive segmentation module to produce context-dependent, variable-length units. Building on a dynamic segmentation procedure, \modelname{} learns to allocate finer granularity to functionally enriched regions while compressing repetitive or redundant sequence. We pre-train \modelname{} on the human reference genome (HG38) and evaluate it on the Nucleotide Transformer and Genomic Benchmarks, where it consistently improves over strong fixed-tokenization baselines. Further analyses and ablations indicate that the learned segmentation is structured rather than incidental: the model preferentially uses shorter units around promoters and exons, and longer units in repetitive regions, yielding representations that are both mutation-resilient and biologically-informed.
Problem

Research questions and friction points this paper is trying to address.

DNA language models
tokenization
sequence segmentation
functional elements
genomic sequences
Innovation

Methods, ideas, or system contributions that make the work stand out.

learnable tokenization
dynamic chunking
DNA language model
biological grammar
masked language modeling
🔎 Similar Papers
No similar papers found.
T
Taewon Kim
Korea Academic Institute of Science and Technology (KAIST)
J
Jihwan Shin
Korea Academic Institute of Science and Technology (KAIST)
H
Hyomin Kim
Korea Academic Institute of Science and Technology (KAIST)
Y
Youngmok Jung
INOCRAS
J
Jonghoon Lee
INOCRAS
Won-Chul Lee
Won-Chul Lee
Pfizer
Cancer GenomicsBioinformaticsNext-Generation Sequencing
Insu Han
Insu Han
Assistant professor, KAIST
machine learningmatrix analysisprobabilistic inference
Sungsoo Ahn
Sungsoo Ahn
KAIST
Machine Learning