From Bytes to Ideas: Language Modeling with Autoregressive U-Nets

📅 2025-06-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Conventional language models rely on static tokenization schemes (e.g., BPE), severely limiting their flexibility in modeling multi-granular text and generalizing across languages. Method: We propose the first framework that internalizes tokenization as a learnable, hierarchical process: taking raw byte sequences as input, it employs an autoregressive U-Net architecture with multi-scale pooling and hierarchical future prediction to jointly model semantic units—from bytes to words to phrases—end-to-end. Contribution/Results: By eliminating fixed tokenization, our approach enables dynamic, end-to-end multi-scale representation learning. Experiments show that, under equivalent pretraining compute budgets, our model matches strong BPE baselines at shallow layers while consistently outperforming them at deeper layers. Moreover, it natively supports low-resource languages and character-level tasks, overcoming the rigid constraints of conventional tokenization paradigms.

Technology Category

Application Category

📝 Abstract
Tokenization imposes a fixed granularity on the input text, freezing how a language model operates on data and how far in the future it predicts. Byte Pair Encoding (BPE) and similar schemes split text once, build a static vocabulary, and leave the model stuck with that choice. We relax this rigidity by introducing an autoregressive U-Net that learns to embed its own tokens as it trains. The network reads raw bytes, pools them into words, then pairs of words, then up to 4 words, giving it a multi-scale view of the sequence. At deeper stages, the model must predict further into the future -- anticipating the next few words rather than the next byte -- so deeper stages focus on broader semantic patterns while earlier stages handle fine details. When carefully tuning and controlling pretraining compute, shallow hierarchies tie strong BPE baselines, and deeper hierarchies have a promising trend. Because tokenization now lives inside the model, the same system can handle character-level tasks and carry knowledge across low-resource languages.
Problem

Research questions and friction points this paper is trying to address.

Overcoming fixed tokenization granularity in language models
Enabling multi-scale text representation via autoregressive U-Nets
Unifying character-level and cross-lingual tasks within one model
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive U-Net for dynamic token embedding
Multi-scale sequence view via byte pooling
Hierarchical prediction for semantic patterns
🔎 Similar Papers
No similar papers found.