Bolmo: Byteifying the Next Generation of Language Models

πŸ“… 2025-12-17
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Subword-based language models suffer from weak character-level understanding and inefficiency due to fixed, large vocabularies. To address this, we introduce Bolmoβ€”the first fully open-source, competitive byte-level language model family at 1B/7B scales. Our approach comprises three key innovations: (1) a lightweight, exact knowledge distillation paradigm that transfers subword models to byte-level representations; (2) a novel architecture specifically designed for byte-level modeling to mitigate expressivity mismatch; and (3) efficient conversion requiring less than 1% of the original pretraining token budget. Bolmo leverages byte-level modeling, high-ratio token compression training, and low-cost, source-subword-ecosystem-driven post-training. It achieves significant gains on character-level and certain programming tasks, matches its source subword models across broad benchmarks, maintains comparable inference speed to subword models, and substantially outperforms same-scale byte-level baselines.

Technology Category

Application Category

πŸ“ Abstract
We introduce Bolmo, the first family of competitive fully open byte-level language models (LMs) at the 1B and 7B parameter scales. In contrast to prior research on byte-level LMs, which focuses predominantly on training from scratch, we train Bolmo by byteifying existing subword-level LMs. Byteification enables overcoming the limitations of subword tokenization - such as insufficient character understanding and efficiency constraints due to the fixed subword vocabulary - while performing at the level of leading subword-level LMs. Bolmo is specifically designed for byteification: our architecture resolves a mismatch between the expressivity of prior byte-level architectures and subword-level LMs, which makes it possible to employ an effective exact distillation objective between Bolmo and the source subword model. This allows for converting a subword-level LM to a byte-level LM by investing less than 1% of a typical pretraining token budget. Bolmo substantially outperforms all prior byte-level LMs of comparable size, and outperforms the source subword-level LMs on character understanding and, in some cases, coding, while coming close to matching the original LMs' performance on other tasks. Furthermore, we show that Bolmo can achieve inference speeds competitive with subword-level LMs by training with higher token compression ratios, and can be cheaply and effectively post-trained by leveraging the existing ecosystem around the source subword-level LM. Our results finally make byte-level LMs a practical choice competitive with subword-level LMs across a wide set of use cases.
Problem

Research questions and friction points this paper is trying to address.

Convert subword-level language models to byte-level without full retraining
Overcome subword tokenization limitations like character understanding
Achieve competitive performance and speed with minimal training cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

Byteifies existing subword models for efficiency
Uses exact distillation to reduce training costs
Achieves competitive inference speeds via compression
πŸ”Ž Similar Papers
No similar papers found.