🤖 AI Summary
The role of subword-level lexical overlap in cross-lingual transfer within multilingual models remains contested, with unresolved debates regarding its facilitative versus interfering effects.
Method: We propose a semantic-similarity-based lexical overlap disentanglement framework, employing controllable subword tokenization to systematically modulate both the degree of lexical overlap and semantic consistency in bilingual autoregressive models. Experiments are conducted on multilingual understanding benchmarks including XNLI and XQuAD.
Contribution/Results: We find that lexical overlap significantly enhances cross-lingual transfer performance, with gains monotonically increasing with overlap degree. Crucially, semantically similar shared subwords—not mere surface-form overlap—serve as the key driver for establishing a unified cross-lingual semantic space. This work provides the first empirical, semantics-grounded evidence of the constructive role of lexical overlap, offering both theoretical foundations and practical guidance for shared vocabulary design in multilingual pretraining.
📝 Abstract
Subword tokenizers trained on multilingual corpora naturally produce overlapping tokens across languages. Does token overlap facilitate cross-lingual transfer or instead introduce interference between languages? Prior work offers mixed evidence, partly due to varied setups and confounders, such as token frequency or subword segmentation granularity. To address this question, we devise a controlled experiment where we train bilingual autoregressive models on multiple language pairs under systematically varied vocabulary overlap settings. Crucially, we explore a new dimension to understanding how overlap affects transfer: the semantic similarity of tokens shared across languages. We first analyze our models' hidden representations and find that overlap of any kind creates embedding spaces that capture cross-lingual semantic relationships, while this effect is much weaker in models with disjoint vocabularies. On XNLI and XQuAD, we find that models with overlap outperform models with disjoint vocabularies, and that transfer performance generally improves as overlap increases. Overall, our findings highlight the advantages of token overlap in multilingual models and show that substantial shared vocabulary remains a beneficial design choice for multilingual tokenizers.