Dynamic Chunking for End-to-End Hierarchical Sequence Modeling

📅 2025-07-10

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Existing language models are constrained by hand-crafted preprocessing steps—such as rule-based or statistical tokenization—hindering truly end-to-end sequence modeling. To address this, we propose the Hierarchical Network (H-Net), the first architecture enabling joint, end-to-end learning of data-driven, dynamic byte-level chunking and hierarchical representation. H-Net employs differentiable pooling and multi-stage abstraction to automatically discover semantically coherent segments without relying on prior linguistic knowledge or fixed segmentation rules. Built upon the Transformer architecture, it achieves substantial gains in data efficiency—up to 4×—and scalability across diverse modalities: English, Chinese, source code, and DNA sequences. Under identical computational budgets, H-Net consistently outperforms Byte-Pair Encoding (BPE)-based baselines. This work establishes a unified, learnable paradigm for raw-sequence modeling, offering a foundation for next-generation multilingual and multimodal foundation models.

Technology Category

Application Category

📝 Abstract

Despite incredible progress in language models (LMs) in recent years, largely resulting from moving away from specialized models designed for specific tasks to general models based on powerful architectures (e.g. the Transformer) that learn everything from raw data, pre-processing steps such as tokenization remain a barrier to true end-to-end foundation models. We introduce a collection of new techniques that enable a dynamic chunking mechanism which automatically learns content -- and context -- dependent segmentation strategies learned jointly with the rest of the model. Incorporating this into an explicit hierarchical network (H-Net) allows replacing the (implicitly hierarchical) tokenization-LM-detokenization pipeline with a single model learned fully end-to-end. When compute- and data- matched, an H-Net with one stage of hierarchy operating at the byte level outperforms a strong Transformer language model operating over BPE tokens. Iterating the hierarchy to multiple stages further increases its performance by modeling multiple levels of abstraction, demonstrating significantly better scaling with data and matching a token-based Transformer of twice its size. H-Nets pretrained on English show significantly increased character-level robustness, and qualitatively learn meaningful data-dependent chunking strategies without any heuristics or explicit supervision. Finally, the H-Net's improvement over tokenized pipelines is further increased in languages and modalities with weaker tokenization heuristics, such as Chinese and code, or DNA sequences (nearly 4x improvement in data efficiency over baselines), showing the potential of true end-to-end models that learn and scale better from unprocessed data.

Problem

Research questions and friction points this paper is trying to address.

Eliminate tokenization barriers in end-to-end language models

Learn dynamic content-dependent chunking strategies automatically

Improve performance in languages with weak tokenization heuristics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic chunking mechanism for content segmentation

Hierarchical network replaces tokenization-LM pipeline

Multi-stage hierarchy improves scaling and performance

🔎 Similar Papers

Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling