Bolmo: Byteifying the Next Generation of Language Models

📅 2025-12-17

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Subword-based language models suffer from weak character-level understanding and inefficiency due to fixed, large vocabularies. To address this, we introduce Bolmo—the first fully open-source, competitive byte-level language model family at 1B/7B scales. Our approach comprises three key innovations: (1) a lightweight, exact knowledge distillation paradigm that transfers subword models to byte-level representations; (2) a novel architecture specifically designed for byte-level modeling to mitigate expressivity mismatch; and (3) efficient conversion requiring less than 1% of the original pretraining token budget. Bolmo leverages byte-level modeling, high-ratio token compression training, and low-cost, source-subword-ecosystem-driven post-training. It achieves significant gains on character-level and certain programming tasks, matches its source subword models across broad benchmarks, maintains comparable inference speed to subword models, and substantially outperforms same-scale byte-level baselines.

Technology Category

Application Category

📝 Abstract

We introduce Bolmo, the first family of competitive fully open byte-level language models (LMs) at the 1B and 7B parameter scales. In contrast to prior research on byte-level LMs, which focuses predominantly on training from scratch, we train Bolmo by byteifying existing subword-level LMs. Byteification enables overcoming the limitations of subword tokenization - such as insufficient character understanding and efficiency constraints due to the fixed subword vocabulary - while performing at the level of leading subword-level LMs. Bolmo is specifically designed for byteification: our architecture resolves a mismatch between the expressivity of prior byte-level architectures and subword-level LMs, which makes it possible to employ an effective exact distillation objective between Bolmo and the source subword model. This allows for converting a subword-level LM to a byte-level LM by investing less than 1% of a typical pretraining token budget. Bolmo substantially outperforms all prior byte-level LMs of comparable size, and outperforms the source subword-level LMs on character understanding and, in some cases, coding, while coming close to matching the original LMs' performance on other tasks. Furthermore, we show that Bolmo can achieve inference speeds competitive with subword-level LMs by training with higher token compression ratios, and can be cheaply and effectively post-trained by leveraging the existing ecosystem around the source subword-level LM. Our results finally make byte-level LMs a practical choice competitive with subword-level LMs across a wide set of use cases.

Problem

Research questions and friction points this paper is trying to address.

Convert subword-level language models to byte-level without full retraining

Overcome subword tokenization limitations like character understanding

Achieve competitive performance and speed with minimal training cost

Innovation

Methods, ideas, or system contributions that make the work stand out.

Byteifies existing subword models for efficiency

Uses exact distillation to reduce training costs

Achieves competitive inference speeds via compression

🔎 Similar Papers

No similar papers found.