False Friends Are Not Foes: Investigating Vocabulary Overlap in Multilingual Language Models

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

The role of subword-level lexical overlap in cross-lingual transfer within multilingual models remains contested, with unresolved debates regarding its facilitative versus interfering effects. Method: We propose a semantic-similarity-based lexical overlap disentanglement framework, employing controllable subword tokenization to systematically modulate both the degree of lexical overlap and semantic consistency in bilingual autoregressive models. Experiments are conducted on multilingual understanding benchmarks including XNLI and XQuAD. Contribution/Results: We find that lexical overlap significantly enhances cross-lingual transfer performance, with gains monotonically increasing with overlap degree. Crucially, semantically similar shared subwords—not mere surface-form overlap—serve as the key driver for establishing a unified cross-lingual semantic space. This work provides the first empirical, semantics-grounded evidence of the constructive role of lexical overlap, offering both theoretical foundations and practical guidance for shared vocabulary design in multilingual pretraining.

Technology Category

Application Category

📝 Abstract

Subword tokenizers trained on multilingual corpora naturally produce overlapping tokens across languages. Does token overlap facilitate cross-lingual transfer or instead introduce interference between languages? Prior work offers mixed evidence, partly due to varied setups and confounders, such as token frequency or subword segmentation granularity. To address this question, we devise a controlled experiment where we train bilingual autoregressive models on multiple language pairs under systematically varied vocabulary overlap settings. Crucially, we explore a new dimension to understanding how overlap affects transfer: the semantic similarity of tokens shared across languages. We first analyze our models' hidden representations and find that overlap of any kind creates embedding spaces that capture cross-lingual semantic relationships, while this effect is much weaker in models with disjoint vocabularies. On XNLI and XQuAD, we find that models with overlap outperform models with disjoint vocabularies, and that transfer performance generally improves as overlap increases. Overall, our findings highlight the advantages of token overlap in multilingual models and show that substantial shared vocabulary remains a beneficial design choice for multilingual tokenizers.

Problem

Research questions and friction points this paper is trying to address.

Investigating whether token overlap facilitates cross-lingual transfer or causes interference

Exploring how semantic similarity of shared tokens affects cross-lingual relationships

Evaluating vocabulary overlap impact on multilingual model performance across tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Controlled bilingual experiments with varied vocabulary overlap

Analyzing semantic similarity of shared tokens across languages

Overlapping tokens create cross-lingual semantic embedding spaces

🔎 Similar Papers

Sharing Matters: Analysing Neurons Across Languages and Tasks in LLMs