Dictionaries to the Rescue: Cross-Lingual Vocabulary Transfer for Low-Resource Languages Using Bilingual Dictionaries

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of adapting pretrained language models to low-resource languages—where monolingual or parallel corpora are scarce—this paper proposes a dictionary-driven cross-lingual vocabulary transfer method that requires neither. The core method leverages, for the first time, the fallback mechanism inherent in Byte-Pair Encoding (BPE) tokenizers to design an iterative framework for bilingual lexicon alignment and subword embedding estimation. Specifically, it reverses BPE merge operations under guidance from a bilingual dictionary, enabling lightweight, unsupervised construction of high-quality subword vocabularies and corresponding embeddings for target languages. Experiments across multiple low-resource languages demonstrate substantial improvements over existing vocabulary transfer approaches. The results validate bilingual dictionaries as effective, highly generalizable structured knowledge sources for vocabulary adaptation—bypassing reliance on raw text data while preserving linguistic granularity and semantic coherence.

Technology Category

Application Category

📝 Abstract
Cross-lingual vocabulary transfer plays a promising role in adapting pre-trained language models to new languages, including low-resource languages. Existing approaches that utilize monolingual or parallel corpora face challenges when applied to languages with limited resources. In this work, we propose a simple yet effective vocabulary transfer method that utilizes bilingual dictionaries, which are available for many languages, thanks to descriptive linguists. Our proposed method leverages a property of BPE tokenizers where removing a subword from the vocabulary causes a fallback to shorter subwords. The embeddings of target subwords are estimated iteratively by progressively removing them from the tokenizer. The experimental results show that our approach outperforms existing methods for low-resource languages, demonstrating the effectiveness of a dictionary-based approach for cross-lingual vocabulary transfer.
Problem

Research questions and friction points this paper is trying to address.

Adapts pre-trained models to low-resource languages via dictionaries
Overcomes limited-resource challenges in cross-lingual vocabulary transfer
Estimates subword embeddings iteratively using bilingual dictionary fallbacks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses bilingual dictionaries for vocabulary transfer
Leverages BPE tokenizer fallback property
Iteratively estimates target subword embeddings
🔎 Similar Papers
No similar papers found.