TREX: Tokenizer Regression for Optimal Data Mixture

📅 2026-01-20

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the lack of efficient and precise language data mixing strategies for constructing high-performance tokenizers in multilingual large language models. To this end, the authors propose TREX, a novel framework that introduces regression modeling to optimize tokenizer data mixing for the first time. TREX samples random language mixture ratios on a small-scale proxy tokenizer, evaluates their compression performance, and trains a regression model to predict the optimal mixture ratio, which then guides the training of a large-scale tokenizer. Compared to LLaMA3’s strategy and uniform mixing, TREX achieves up to a 12% improvement in both in-distribution and out-of-distribution compression efficiency, significantly outperforming conventional heuristic or brute-force search methods while offering high accuracy, low cost, strong scalability, and robustness.

Technology Category

Application Category

📝 Abstract

Building effective tokenizers for multilingual Large Language Models (LLMs) requires careful control over language-specific data mixtures. While a tokenizer's compression performance critically affects the efficiency of LLM training and inference, existing approaches rely on heuristics or costly large-scale searches to determine optimal language ratios. We introduce Tokenizer Regression for Optimal Data MiXture (TREX), a regression-based framework that efficiently predicts the optimal data mixture for tokenizer training. TREX trains small-scale proxy tokenizers on random mixtures, gathers their compression statistics, and learns to predict compression performance from data mixtures. This learned model enables scalable mixture search before large-scale tokenizer training, mitigating the accuracy-cost trade-off in multilingual tokenizer design. Tokenizers trained with TReX's predicted mixtures outperform mixtures based on LLaMA3 and uniform distributions by up to 12% in both inand out-of-distribution compression efficiency, demonstrating strong scalability, robustness, and practical effectiveness.

Problem

Research questions and friction points this paper is trying to address.

tokenizer

multilingual

data mixture

compression efficiency

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

tokenizer regression

optimal data mixture

multilingual LLMs