TREX: Tokenizer Regression for Optimal Data Mixture

📅 2026-01-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of efficient and precise language data mixing strategies for constructing high-performance tokenizers in multilingual large language models. To this end, the authors propose TREX, a novel framework that introduces regression modeling to optimize tokenizer data mixing for the first time. TREX samples random language mixture ratios on a small-scale proxy tokenizer, evaluates their compression performance, and trains a regression model to predict the optimal mixture ratio, which then guides the training of a large-scale tokenizer. Compared to LLaMA3’s strategy and uniform mixing, TREX achieves up to a 12% improvement in both in-distribution and out-of-distribution compression efficiency, significantly outperforming conventional heuristic or brute-force search methods while offering high accuracy, low cost, strong scalability, and robustness.

Technology Category

Application Category

📝 Abstract
Building effective tokenizers for multilingual Large Language Models (LLMs) requires careful control over language-specific data mixtures. While a tokenizer's compression performance critically affects the efficiency of LLM training and inference, existing approaches rely on heuristics or costly large-scale searches to determine optimal language ratios. We introduce Tokenizer Regression for Optimal Data MiXture (TREX), a regression-based framework that efficiently predicts the optimal data mixture for tokenizer training. TREX trains small-scale proxy tokenizers on random mixtures, gathers their compression statistics, and learns to predict compression performance from data mixtures. This learned model enables scalable mixture search before large-scale tokenizer training, mitigating the accuracy-cost trade-off in multilingual tokenizer design. Tokenizers trained with TReX's predicted mixtures outperform mixtures based on LLaMA3 and uniform distributions by up to 12% in both inand out-of-distribution compression efficiency, demonstrating strong scalability, robustness, and practical effectiveness.
Problem

Research questions and friction points this paper is trying to address.

tokenizer
multilingual
data mixture
compression efficiency
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

tokenizer regression
optimal data mixture
multilingual LLMs
compression efficiency
proxy tokenizers
🔎 Similar Papers
No similar papers found.
I
Inho Won
KAIST CT
H
Hangyeol Yoo
Seoul National University of Science and Technology
M
Minkyung Cho
KAIST CT
J
Jungyeul Park
KAIST CT, Upstage AI
Hoyun Song
Hoyun Song
Postdoctoral researcher, KAIST
NLPKnowledge IntegrationDomain-Specific ModelingLLM
K
Kyungtae Lim
KAIST CT, KAIST InnoCORE PRISM-AI Center