Tokenization with Split Trees

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work addresses limitations in traditional subword tokenization methods regarding compression efficiency and contextual utilization. It proposes a novel segmentation approach based on binary partition trees, formulating tokenization as a recursive search problem over such trees. By precomputing byte n-gram frequencies to construct a full binary tree and leveraging integer programming—along with its linear programming relaxation—the method efficiently derives a near-optimal vocabulary tailored for compression-rate optimization through recursive inference. Evaluated on English text, the approach reduces token counts by over 11% compared to baselines like BPE, yielding substantial gains in Rényi efficiency. In training 1.5B-parameter language models, it achieves CORE score improvements of 2.6%–7.6% and outperforms competitors on 13 out of 22 benchmark tasks.

📝 Abstract

We introduce Tokenization with Split Trees (ToaST), a subword tokenization method that directly optimizes compression under a new recursive inference procedure. ToaST greedily splits each pretoken into a full binary tree using precomputed byte n-gram counts, independent of any vocabulary. Given a vocabulary, inference recursively descends each split tree and emits the first in-vocabulary node reached on each path. Vocabulary selection is formulated as an Integer Program (IP) that minimizes the total token count over all split trees under this inference procedure. The Linear Programming (LP) relaxation is near-integral in practice, yielding provably near-optimal vocabularies, with training time empirically scaling quadratically in the number of split trees. On English text, ToaST reduces token counts by more than 11% compared to BPE, WordPiece, and UnigramLM at vocabulary sizes of 40,960 and above, reducing the number of inference tokens for models using this tokenizer, thus extending the effective context length. ToaST also uses common single-byte tokens less frequently than these baselines, leading to a substantial improvement in Renyi efficiency. In experiments training 1.5B parameter language models, ToaST achieves the highest CORE score, outperforming baselines by 2.6%--7.6%, with significance for two of three, and scoring best on 13 of 22 individual tasks.

Problem

Research questions and friction points this paper is trying to address.

subword tokenization

compression efficiency

vocabulary optimization

token count reduction

Renyi efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Split Trees

Subword Tokenization

Integer Programming