False Friends Are Not Foes: Investigating Vocabulary Overlap in Multilingual Language Models

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The role of subword-level lexical overlap in cross-lingual transfer within multilingual models remains contested, with unresolved debates regarding its facilitative versus interfering effects. Method: We propose a semantic-similarity-based lexical overlap disentanglement framework, employing controllable subword tokenization to systematically modulate both the degree of lexical overlap and semantic consistency in bilingual autoregressive models. Experiments are conducted on multilingual understanding benchmarks including XNLI and XQuAD. Contribution/Results: We find that lexical overlap significantly enhances cross-lingual transfer performance, with gains monotonically increasing with overlap degree. Crucially, semantically similar shared subwords—not mere surface-form overlap—serve as the key driver for establishing a unified cross-lingual semantic space. This work provides the first empirical, semantics-grounded evidence of the constructive role of lexical overlap, offering both theoretical foundations and practical guidance for shared vocabulary design in multilingual pretraining.

Technology Category

Application Category

📝 Abstract
Subword tokenizers trained on multilingual corpora naturally produce overlapping tokens across languages. Does token overlap facilitate cross-lingual transfer or instead introduce interference between languages? Prior work offers mixed evidence, partly due to varied setups and confounders, such as token frequency or subword segmentation granularity. To address this question, we devise a controlled experiment where we train bilingual autoregressive models on multiple language pairs under systematically varied vocabulary overlap settings. Crucially, we explore a new dimension to understanding how overlap affects transfer: the semantic similarity of tokens shared across languages. We first analyze our models' hidden representations and find that overlap of any kind creates embedding spaces that capture cross-lingual semantic relationships, while this effect is much weaker in models with disjoint vocabularies. On XNLI and XQuAD, we find that models with overlap outperform models with disjoint vocabularies, and that transfer performance generally improves as overlap increases. Overall, our findings highlight the advantages of token overlap in multilingual models and show that substantial shared vocabulary remains a beneficial design choice for multilingual tokenizers.
Problem

Research questions and friction points this paper is trying to address.

Investigating whether token overlap facilitates cross-lingual transfer or causes interference
Exploring how semantic similarity of shared tokens affects cross-lingual relationships
Evaluating vocabulary overlap impact on multilingual model performance across tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Controlled bilingual experiments with varied vocabulary overlap
Analyzing semantic similarity of shared tokens across languages
Overlapping tokens create cross-lingual semantic embedding spaces
🔎 Similar Papers
No similar papers found.