๐ค AI Summary
This work investigates the role of code-switching in pretraining multilingual large language models (MLLMs) to enhance cross-lingual alignment and generalization. To address the lack of systematic understanding, we propose the first taxonomy of code-switching types and quantitatively characterize their language transfer effects. We further design a proportionally scaled synthetic code-switching data generation strategy, enabling scalable and controllable pretraining augmentation. Through corpus analysis, representation space evaluation, and ablation studies across multilingual benchmarks, we demonstrate that our approach significantly improves alignment for low-resource languages while preserving performance on medium- and high-resource languagesโand remains robust across varying pretraining corpus quality. Our core contributions are twofold: (1) revealing natural code-switching as a critical carrier of cross-lingual capability, and (2) establishing the first synthetic code-switching paradigm explicitly optimized for MLLM pretraining.
๐ Abstract
Large language models (LLMs) exhibit remarkable multilingual capabilities despite the extreme language imbalance in the pre-training data. In this paper, we closely examine the reasons behind this phenomenon, focusing on the pre-training corpus. We find that the existence of code-switching, alternating between different languages within a context, is key to multilingual capabilities. We conduct an analysis to investigate code-switching in the pre-training corpus, examining its presence and categorizing it into four types within two quadrants. We then assess its impact on multilingual performance. These types of code-switching data are unbalanced in proportions and demonstrate different effects on facilitating language transfer. To better explore the power of code-switching for language alignment during pre-training, we investigate the strategy of synthetic code-switching. We continuously scale up the synthetic code-switching data and observe remarkable improvements in both benchmarks and representation space. Extensive experiments indicate that incorporating synthetic code-switching data enables better language alignment and generalizes well to high, medium, and low-resource languages with pre-training corpora of varying qualities.