🤖 AI Summary
To address the challenge of adapting pretrained language models to low-resource languages—where monolingual or parallel corpora are scarce—this paper proposes a dictionary-driven cross-lingual vocabulary transfer method that requires neither. The core method leverages, for the first time, the fallback mechanism inherent in Byte-Pair Encoding (BPE) tokenizers to design an iterative framework for bilingual lexicon alignment and subword embedding estimation. Specifically, it reverses BPE merge operations under guidance from a bilingual dictionary, enabling lightweight, unsupervised construction of high-quality subword vocabularies and corresponding embeddings for target languages. Experiments across multiple low-resource languages demonstrate substantial improvements over existing vocabulary transfer approaches. The results validate bilingual dictionaries as effective, highly generalizable structured knowledge sources for vocabulary adaptation—bypassing reliance on raw text data while preserving linguistic granularity and semantic coherence.
📝 Abstract
Cross-lingual vocabulary transfer plays a promising role in adapting pre-trained language models to new languages, including low-resource languages. Existing approaches that utilize monolingual or parallel corpora face challenges when applied to languages with limited resources. In this work, we propose a simple yet effective vocabulary transfer method that utilizes bilingual dictionaries, which are available for many languages, thanks to descriptive linguists. Our proposed method leverages a property of BPE tokenizers where removing a subword from the vocabulary causes a fallback to shorter subwords. The embeddings of target subwords are estimated iteratively by progressively removing them from the tokenizer. The experimental results show that our approach outperforms existing methods for low-resource languages, demonstrating the effectiveness of a dictionary-based approach for cross-lingual vocabulary transfer.