🤖 AI Summary
This work addresses the inefficiency in large language models when processing non-Latin scripts, where excessive token fragmentation degrades inference performance. To mitigate this issue, the authors propose FragMend, a framework that leverages interpretability-driven vocabulary expansion to selectively incorporate high-frequency, semantically cohesive subwords. The method further introduces an embedding initialization strategy informed by subword detokenization patterns. Evaluated across multiple non-Latin languages, FragMend substantially reduces token fragmentation and improves the trade-off between performance and token efficiency by approximately 20 percentage points over baseline models, effectively alleviating a key bottleneck in multilingual language modeling.
📝 Abstract
All languages are equal; when it comes to tokenization, some are more equal than others. Tokens are the hidden currency that dictate the cost and latency of access to contemporary LLMs. However, many languages written in non-Latin scripts observe a poor exchange rate: LLMs take several multiples of tokens to encode the same information in many languages as they do for English. Our analysis reveals that this issue, known as 'token over-fragmentation', persists in modern open-weight LLMs. The standard remedy is vocabulary expansion that adds target language items missing from the model's vocabulary. In this work, we comprehensively study and advance interpretability-based vocabulary expansion, a new research direction. We focus on two core decisions in the vocabulary expansion process: What items should we add? and How should we initialize their corresponding input and output embeddings? First, we question the conventional use of frequency-based methods to choose candidate vocabulary items to add (a decision long treated as settled), and show that interpretability-based methods offer a superior performance-token efficiency trade-off. Next, we strengthen the case for interpretability-based embedding initialization by showing large gains (~20 pts) over baseline initialization methods for several languages written in non-Latin scripts. We identify the phenomenon of "subword detokenization" where models progressively merge fragmented subword tokens into larger subwords across layers. Grounded in our analysis of this phenomenon, we propose FragMend to further push the efficiency ceiling of interpretability-based expansion. We validate the effectiveness of FragMend through comparison against strong baselines and we present extensive analysis of its design choices.