🤖 AI Summary
Existing tokenizers for language models exhibit short average character-per-token ratios, resulting in excessive token counts and suboptimal training/inference efficiency. To address this, we propose a novel tokenizer design method grounded in graph partitioning, which formulates vocabulary optimization as a length-weighted objective maximization problem. A greedy approximation algorithm is developed to solve it, significantly compressing token sequence length while preserving full lexical coverage. Our approach is fully compatible with standard Transformer architectures and serves as a drop-in replacement for conventional tokenizers (e.g., BPE), reducing embedding layer parameter count and KV cache memory footprint. Experiments demonstrate that, compared to BPE, our method reduces token count by 14–18%, decreases training steps by up to 18.5%, lowers inference latency by 13.7%, improves throughput by 16%, and consistently enhances performance across multiple downstream tasks.
📝 Abstract
We introduce a new tokenizer for language models that minimizes the average tokens per character, thereby reducing the number of tokens needed to represent text during training and to generate text during inference. Our method, which we refer to as the Length-MAX tokenizer, obtains its vocabulary by casting a length-weighted objective maximization as a graph partitioning problem and developing a greedy approximation algorithm. On FineWeb and diverse domains, it yields 14--18% fewer tokens than Byte Pair Encoding (BPE) across vocabulary sizes from 10K to 50K, and the reduction is 13.0% when the size is 64K. Training GPT-2 models at 124M, 355M, and 1.3B parameters from scratch with five runs each shows 18.5%, 17.2%, and 18.5% fewer steps, respectively, to reach a fixed validation loss, and 13.7%, 12.7%, and 13.7% lower inference latency, together with a 16% throughput gain at 124M, while consistently improving on downstream tasks including reducing LAMBADA perplexity by 11.7% and enhancing HellaSwag accuracy by 4.3%. Moreover, the Length-MAX tokenizer achieves 99.62% vocabulary coverage and the out-of-vocabulary rate remains low at 0.12% on test sets. These results demonstrate that optimizing for average token length, rather than frequency alone, offers an effective approach to more efficient language modeling without sacrificing -- and often improving -- downstream performance. The tokenizer is compatible with production systems and reduces embedding and KV-cache memory by 18% at inference.