TokAlign++: Advancing Vocabulary Adaptation via Better Token Alignment

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

This work addresses the inefficiencies in tokenization and challenges in knowledge transfer caused by vocabulary mismatch in large language models. It proposes a novel vocabulary adaptation paradigm based on bilingual token alignment, treating source and target vocabularies as two distinct “languages.” The approach learns an alignment dictionary from monolingual token representations, uses it to reparameterize the model, and combines progressive fine-tuning with token-level distillation. Evaluated across 15 languages, the method recovers original model performance in only 1,000 training steps. When further integrated with vocabulary unification, it substantially enhances base model performance using just 235 million tokens of distillation data, while simultaneously improving multilingual text compression rates and preserving the model’s original capabilities.

📝 Abstract

Tokenization is a foundational step in the text process of Large Language Models (LLMs). Texts must be first tokenized into token IDs, which are then input to LLMs. Inefficient tokenization results in long token-ID sequences and will slow down the training and inference of LLMs. The fine-grained knowledge transfer between LLMs, like token-level distillation, is also impeded by the mismatch in vocabulary. To bridge this gap, we introduce a method named TokAlign++ to improve vocabulary adaptation performance by learning better token alignment lexicon. The source and target vocabularies are taken as two different languages, and the bilingual token alignment lexicon is learned from monolingual token representations. Model parameters are rearranged following this bilingual lexicon for new vocabulary, and progressively fine-tuned for adaptation. Experimental results on 15 languages show that our method boosts the multilingual text compression rates and preserves most of the multilingual ability of vanilla models. It costs as few as 1k steps to restore the performance of the vanilla model. After unifying vocabularies between vanilla models, token-level distillation remarkably improves the base model with only 235M tokens.

Problem

Research questions and friction points this paper is trying to address.

tokenization

vocabulary mismatch

token-level distillation

large language models

text compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

token alignment

vocabulary adaptation

multilingual compression