Investigating and Scaling up Code-Switching for Multilingual Language Model Pre-Training

📅 2025-04-02

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work investigates the role of code-switching in pretraining multilingual large language models (MLLMs) to enhance cross-lingual alignment and generalization. To address the lack of systematic understanding, we propose the first taxonomy of code-switching types and quantitatively characterize their language transfer effects. We further design a proportionally scaled synthetic code-switching data generation strategy, enabling scalable and controllable pretraining augmentation. Through corpus analysis, representation space evaluation, and ablation studies across multilingual benchmarks, we demonstrate that our approach significantly improves alignment for low-resource languages while preserving performance on medium- and high-resource languages—and remains robust across varying pretraining corpus quality. Our core contributions are twofold: (1) revealing natural code-switching as a critical carrier of cross-lingual capability, and (2) establishing the first synthetic code-switching paradigm explicitly optimized for MLLM pretraining.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) exhibit remarkable multilingual capabilities despite the extreme language imbalance in the pre-training data. In this paper, we closely examine the reasons behind this phenomenon, focusing on the pre-training corpus. We find that the existence of code-switching, alternating between different languages within a context, is key to multilingual capabilities. We conduct an analysis to investigate code-switching in the pre-training corpus, examining its presence and categorizing it into four types within two quadrants. We then assess its impact on multilingual performance. These types of code-switching data are unbalanced in proportions and demonstrate different effects on facilitating language transfer. To better explore the power of code-switching for language alignment during pre-training, we investigate the strategy of synthetic code-switching. We continuously scale up the synthetic code-switching data and observe remarkable improvements in both benchmarks and representation space. Extensive experiments indicate that incorporating synthetic code-switching data enables better language alignment and generalizes well to high, medium, and low-resource languages with pre-training corpora of varying qualities.

Problem

Research questions and friction points this paper is trying to address.

Investigating code-switching's role in multilingual LLM pre-training

Analyzing code-switching types and their impact on language transfer

Scaling synthetic code-switching to improve multilingual alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing code-switching impact on multilingual performance

Developing synthetic code-switching for language alignment

Scaling synthetic data to enhance language model capabilities

🔎 Similar Papers

DEPT: Decoupled Embeddings for Pre-training Language Models