One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers

📅 2025-06-12

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

To address insufficient language coverage in multilingual large language model (LLM) pretraining and the difficulty of extending post-trained models to new languages, this paper proposes a low-cost intervention early in pretraining: designing a universal multilingual tokenizer that covers significantly more languages than those present in the pretraining corpus, thereby enhancing the model’s “linguistic plasticity.” This approach empirically demonstrates—for the first time—that a unified tokenizer substantially improves zero-shot and few-shot adaptation to unseen languages without degrading performance on major languages. On cross-lingual win-rate benchmarks, language adaptation gains for newly added languages improve by up to 20.2%; even for entirely novel languages absent from both the tokenizer vocabulary and pretraining data, performance increases by up to 5%. The core contribution is the identification and empirical validation that tokenizer generalizability established early in pretraining is a decisive factor governing a multilingual model’s downstream capacity for language expansion.

Technology Category

Application Category

📝 Abstract

Pretraining massively multilingual Large Language Models (LLMs) for many languages at once is challenging due to limited model capacity, scarce high-quality data, and compute constraints. Moreover, the lack of language coverage of the tokenizer makes it harder to address the gap for new languages purely at the post-training stage. In this work, we study what relatively cheap interventions early on in training improve"language plasticity", or adaptation capabilities of the model post-training to new languages. We focus on tokenizer design and propose using a universal tokenizer that is trained for more languages than the primary pretraining languages to enable efficient adaptation in expanding language coverage after pretraining. Our systematic experiments across diverse groups of languages and different training strategies show that a universal tokenizer enables significantly higher language adaptation, with up to 20.2% increase in win rates compared to tokenizers specific to pretraining languages. Furthermore, a universal tokenizer also leads to better plasticity towards languages that are completely unseen in the tokenizer and pretraining, by up to 5% win rate gain. We achieve this adaptation to an expanded set of languages with minimal compromise in performance on the majority of languages included in pretraining.

Problem

Research questions and friction points this paper is trying to address.

Improving multilingual LLM adaptation via universal tokenizer design

Addressing limited language coverage in tokenizers post-training

Enhancing model plasticity for unseen languages efficiently

Innovation

Methods, ideas, or system contributions that make the work stand out.

Universal tokenizer for multilingual pretraining adaptability

Enhanced language plasticity via expanded tokenizer coverage

Improved performance on unseen languages post-training

🔎 Similar Papers

No similar papers found.