How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text?

📅 2024-06-17

📈 Citations: 16

✨ Influential: 1

career value

160K/year

🤖 AI Summary

Large language models (LLMs) suffer from excessive inference steps and high computational cost in low-resource language generation due to English-centric tokenizers ill-suited for morphologically diverse or data-scarce languages. Method: This paper proposes an efficient vocabulary expansion method tailored to extremely low-resource settings—requiring only 0.01 GB of text (~30K sentences)—featuring (i) a cross-lingual mapping and clustering-driven embedding initialization strategy, (ii) lightweight continual pretraining, and (iii) a multilingual typological validation framework to ensure cross-linguistic generalizability. Contribution/Results: Experiments across diverse multilingual benchmarks, tasks, and model architectures demonstrate a 20–40% reduction in inference steps while matching the downstream performance of high-resource baselines. To our knowledge, this is the first systematic study of vocabulary expansion under extreme data scarcity, eliminating reliance on large monolingual corpora and establishing a new paradigm for deploying multilingual LLMs in resource-constrained environments.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have shown remarkable capabilities in many languages beyond English. Yet, LLMs require more inference steps when generating non-English text due to their reliance on English-centric tokenizers and vocabulary, resulting in higher usage costs to non-English speakers. Vocabulary expansion with target language tokens is a widely used cross-lingual vocabulary adaptation approach to remedy this issue. Despite its effectiveness in inference speedup, previous work on vocabulary expansion has focused on high-resource settings assuming access to a substantial amount of target language data to effectively initialize the embeddings of the new tokens and adapt the LLM to the target language. However, vocabulary expansion in low-resource settings has yet to be explored. In this paper, we investigate vocabulary expansion in low-resource settings by considering embedding initialization methods and continual pre-training strategies. Through extensive experiments across typologically diverse languages, tasks and models, we establish a set of strategies to perform vocabulary expansion for faster inference, maintaining competitive downstream performance to baselines with only 30K sentences ($sim$0.01GB text data) from the target language.

Problem

Research questions and friction points this paper is trying to address.

Expanding LLM vocabulary efficiently with minimal target language data

Addressing slow non-English inference from English-centric tokenizers

Maintaining performance while accelerating multilingual text generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vocabulary expansion with minimal target language data

Embedding initialization methods for low-resource settings

Continual pre-training strategies for multilingual adaptation

🔎 Similar Papers

Large Vocabulary Size Improves Large Language Models