🤖 AI Summary
This work addresses the lack of efficient and precise language data mixing strategies for constructing high-performance tokenizers in multilingual large language models. To this end, the authors propose TREX, a novel framework that introduces regression modeling to optimize tokenizer data mixing for the first time. TREX samples random language mixture ratios on a small-scale proxy tokenizer, evaluates their compression performance, and trains a regression model to predict the optimal mixture ratio, which then guides the training of a large-scale tokenizer. Compared to LLaMA3’s strategy and uniform mixing, TREX achieves up to a 12% improvement in both in-distribution and out-of-distribution compression efficiency, significantly outperforming conventional heuristic or brute-force search methods while offering high accuracy, low cost, strong scalability, and robustness.
📝 Abstract
Building effective tokenizers for multilingual Large Language Models (LLMs) requires careful control over language-specific data mixtures. While a tokenizer's compression performance critically affects the efficiency of LLM training and inference, existing approaches rely on heuristics or costly large-scale searches to determine optimal language ratios. We introduce Tokenizer Regression for Optimal Data MiXture (TREX), a regression-based framework that efficiently predicts the optimal data mixture for tokenizer training. TREX trains small-scale proxy tokenizers on random mixtures, gathers their compression statistics, and learns to predict compression performance from data mixtures. This learned model enables scalable mixture search before large-scale tokenizer training, mitigating the accuracy-cost trade-off in multilingual tokenizer design. Tokenizers trained with TReX's predicted mixtures outperform mixtures based on LLaMA3 and uniform distributions by up to 12% in both inand out-of-distribution compression efficiency, demonstrating strong scalability, robustness, and practical effectiveness.