Parallel Tokenizers: Rethinking Vocabulary Design for Cross-Lingual Transfer

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Current multilingual tokenizers suffer from misaligned cross-lingual vocabularies, causing semantically equivalent words—e.g., English “I eat rice” and Hausa “Ina cin shinkafa”—to be mapped to distinct embeddings, severely hindering cross-lingual transfer for low-resource languages. To address this, we propose a parallel tokenizer framework: first training monolingual tokenizers independently, then aligning their vocabularies at the lexical level using bilingual dictionaries to enable semantically consistent cross-lingual subword sharing; finally constructing a unified, frequency-balanced shared semantic space. This work is the first to systematically reformulate vocabulary design in cross-lingual pretraining, enabling end-to-end Transformer pretraining. Pretrained on 13 low-resource languages, our model significantly outperforms mBERT and XLM-R on downstream tasks—including sentiment analysis and hate speech detection—demonstrating that vocabulary alignment is fundamental to cross-lingual generalization.

Technology Category

Application Category

📝 Abstract

Tokenization defines the foundation of multilingual language models by determining how words are represented and shared across languages. However, existing methods often fail to support effective cross-lingual transfer because semantically equivalent words are assigned distinct embeddings. For example, "I eat rice" in English and "Ina cin shinkafa" in Hausa are typically mapped to different vocabulary indices, preventing shared representations and limiting cross-lingual generalization. We introduce parallel tokenizers. This new framework trains tokenizers monolingually and then aligns their vocabularies exhaustively using bilingual dictionaries or word-to-word translation, ensuring consistent indices for semantically equivalent words. This alignment enforces a shared semantic space across languages while naturally improving fertility balance. To assess their effectiveness, we pretrain a transformer encoder from scratch on thirteen low-resource languages and evaluate it on sentiment analysis, hate speech detection, emotion classification, and sentence embedding similarity. Across all tasks, models trained with parallel tokenizers outperform conventional multilingual baselines, confirming that rethinking tokenization is essential for advancing multilingual representation learning--especially in low-resource settings.

Problem

Research questions and friction points this paper is trying to address.

Addressing ineffective cross-lingual transfer in multilingual language models

Aligning vocabularies for consistent semantic representations across languages

Improving multilingual performance in low-resource language settings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Monolingual tokenizer training with bilingual dictionary alignment

Exhaustive vocabulary mapping for semantic equivalence

Shared semantic space creation across multiple languages

🔎 Similar Papers

No similar papers found.