🤖 AI Summary
Large language models (LLMs) suffer from excessive inference steps and high computational cost in low-resource language generation due to English-centric tokenizers ill-suited for morphologically diverse or data-scarce languages. Method: This paper proposes an efficient vocabulary expansion method tailored to extremely low-resource settings—requiring only 0.01 GB of text (~30K sentences)—featuring (i) a cross-lingual mapping and clustering-driven embedding initialization strategy, (ii) lightweight continual pretraining, and (iii) a multilingual typological validation framework to ensure cross-linguistic generalizability. Contribution/Results: Experiments across diverse multilingual benchmarks, tasks, and model architectures demonstrate a 20–40% reduction in inference steps while matching the downstream performance of high-resource baselines. To our knowledge, this is the first systematic study of vocabulary expansion under extreme data scarcity, eliminating reliance on large monolingual corpora and establishing a new paradigm for deploying multilingual LLMs in resource-constrained environments.
📝 Abstract
Large language models (LLMs) have shown remarkable capabilities in many languages beyond English. Yet, LLMs require more inference steps when generating non-English text due to their reliance on English-centric tokenizers and vocabulary, resulting in higher usage costs to non-English speakers. Vocabulary expansion with target language tokens is a widely used cross-lingual vocabulary adaptation approach to remedy this issue. Despite its effectiveness in inference speedup, previous work on vocabulary expansion has focused on high-resource settings assuming access to a substantial amount of target language data to effectively initialize the embeddings of the new tokens and adapt the LLM to the target language. However, vocabulary expansion in low-resource settings has yet to be explored. In this paper, we investigate vocabulary expansion in low-resource settings by considering embedding initialization methods and continual pre-training strategies. Through extensive experiments across typologically diverse languages, tasks and models, we establish a set of strategies to perform vocabulary expansion for faster inference, maintaining competitive downstream performance to baselines with only 30K sentences ($sim$0.01GB text data) from the target language.