🤖 AI Summary
This work addresses the inefficiency of generic subword tokenizers in specialized domains or languages, where mismatches between the fixed vocabulary and target corpora often compromise both compression efficiency and model performance. The authors propose a post-training tokenizer adaptation strategy that, under a fixed vocabulary size, dynamically replaces low-utility tokens based on the frequency distribution and lexical utility of the target corpus, enabling lightweight vocabulary reconfiguration. Notably, this approach treats the tokenizer as a tunable component for the first time, enhancing domain adaptability without requiring model retraining. Experimental results across multilingual generation and classification tasks demonstrate that the method significantly outperforms baseline tokenizers, consistently improving both test-set compression ratios and downstream task performance at identical vocabulary sizes.
📝 Abstract
Subword tokenization methods, such as Byte-Pair Encoding (BPE), significantly impact the performance and efficiency of large language models (LLMs). The standard approach involves training a general-purpose tokenizer that uniformly processes all textual data during both training and inference. However, the use of a generic set of tokens can incur inefficiencies when applying the model to specific domains or languages. To address this limitation, we propose a post-training adaptation strategy that selectively replaces low-utility tokens with more relevant ones based on their frequency in an adaptation corpus. Our algorithm identifies the token inventory that most effectively encodes the adaptation corpus for a given target vocabulary size. Extensive experiments on generation and classification tasks across multiple languages demonstrate that our adapted tokenizers compress test corpora more effectively than baselines using the same vocabulary size. This method serves as a lightweight adaptation mechanism, akin to a vocabulary fine-tuning process, enabling optimized tokenization for specific domains or tasks. Our code and data are available at https://github.com/vijini/Adapt-BPE.git.