🤖 AI Summary
This work systematically investigates cross-lingual interference in Transformer encoders under multilingual settings. We construct a fine-grained cross-lingual interference matrix across 83 languages, quantifying the asymmetry in performance transfer between language pairs. Contrary to expectations, interference patterns exhibit weak correlation with conventional linguistic proxies—such as language family or embedding similarity—but are strongly governed by writing systems (scripts). Moreover, the interference matrix effectively predicts downstream task performance. Methodologically, we train and evaluate numerous lightweight BERT-like models on all language-pair combinations and conduct multi-dimensional linguistic correlation analysis. Our study is the first to reveal the fundamental script dependence of cross-lingual interference, providing interpretable, empirically grounded insights for multilingual model architecture design, pretraining strategy optimization, and principled language selection.
📝 Abstract
In this paper, we present a comprehensive study of language interference in encoder-only Transformer models across 83 languages. We construct an interference matrix by training and evaluating small BERT-like models on all possible language pairs, providing a large-scale quantification of cross-lingual interference. Our analysis reveals that interference between languages is asymmetrical and that its patterns do not align with traditional linguistic characteristics, such as language family, nor with proxies like embedding similarity, but instead better relate to script. Finally, we demonstrate that the interference matrix effectively predicts performance on downstream tasks, serving as a tool to better design multilingual models to obtain optimal performance.