Dictionaries to the Rescue: Cross-Lingual Vocabulary Transfer for Low-Resource Languages Using Bilingual Dictionaries

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

To address the challenge of adapting pretrained language models to low-resource languages—where monolingual or parallel corpora are scarce—this paper proposes a dictionary-driven cross-lingual vocabulary transfer method that requires neither. The core method leverages, for the first time, the fallback mechanism inherent in Byte-Pair Encoding (BPE) tokenizers to design an iterative framework for bilingual lexicon alignment and subword embedding estimation. Specifically, it reverses BPE merge operations under guidance from a bilingual dictionary, enabling lightweight, unsupervised construction of high-quality subword vocabularies and corresponding embeddings for target languages. Experiments across multiple low-resource languages demonstrate substantial improvements over existing vocabulary transfer approaches. The results validate bilingual dictionaries as effective, highly generalizable structured knowledge sources for vocabulary adaptation—bypassing reliance on raw text data while preserving linguistic granularity and semantic coherence.

Technology Category

Application Category

📝 Abstract

Cross-lingual vocabulary transfer plays a promising role in adapting pre-trained language models to new languages, including low-resource languages. Existing approaches that utilize monolingual or parallel corpora face challenges when applied to languages with limited resources. In this work, we propose a simple yet effective vocabulary transfer method that utilizes bilingual dictionaries, which are available for many languages, thanks to descriptive linguists. Our proposed method leverages a property of BPE tokenizers where removing a subword from the vocabulary causes a fallback to shorter subwords. The embeddings of target subwords are estimated iteratively by progressively removing them from the tokenizer. The experimental results show that our approach outperforms existing methods for low-resource languages, demonstrating the effectiveness of a dictionary-based approach for cross-lingual vocabulary transfer.

Problem

Research questions and friction points this paper is trying to address.

Adapts pre-trained models to low-resource languages via dictionaries

Overcomes limited-resource challenges in cross-lingual vocabulary transfer

Estimates subword embeddings iteratively using bilingual dictionary fallbacks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses bilingual dictionaries for vocabulary transfer

Leverages BPE tokenizer fallback property

Iteratively estimates target subword embeddings

🔎 Similar Papers

No similar papers found.