🤖 AI Summary
Current DNA language models are constrained by fixed tokenization strategies, limiting their ability to effectively capture the complex structural patterns of functional genomic regions. To address this, this work proposes DNACHUNKER—a learnable, dynamic DNA tokenization mechanism that adaptively segments sequences into variable-length semantic chunks through masked language modeling pretraining on the human reference genome (hg38). DNACHUNKER introduces, for the first time, an H-Net–based dynamic chunking strategy that adjusts token granularity according to functional importance, employing finer-grained segmentation in critical regions such as promoters and exons to preserve biological detail. This adaptive approach enhances robustness to mutations and positional shifts. Experimental results demonstrate that DNACHUNKER significantly outperforms baseline methods on both the Nucleotide Transformer and Genomic benchmark suites.
📝 Abstract
DNA language models are increasingly used to represent genomic sequence, yet their effectiveness depends critically on how raw nucleotides are converted into model inputs. Unlike natural language, DNA offers no canonical boundaries, making fixed tokenizations a brittle design choice under shifts, indels, and local repeats. We introduce \modelname{}, a masked DNA language model that incorporates a learnable adaptive segmentation module to produce context-dependent, variable-length units. Building on a dynamic segmentation procedure, \modelname{} learns to allocate finer granularity to functionally enriched regions while compressing repetitive or redundant sequence. We pre-train \modelname{} on the human reference genome (HG38) and evaluate it on the Nucleotide Transformer and Genomic Benchmarks, where it consistently improves over strong fixed-tokenization baselines. Further analyses and ablations indicate that the learned segmentation is structured rather than incidental: the model preferentially uses shorter units around promoters and exons, and longer units in repetitive regions, yielding representations that are both mutation-resilient and biologically-informed.