A Post-trainer's Guide to Multilingual Training Data: Uncovering Cross-lingual Transfer Dynamics

📅 2025-04-23

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This study investigates the dynamic mechanisms of cross-lingual transfer (CLT) in large language models (35B parameters) under realistic post-training scenarios, focusing on multilingual generation across summarization, instruction following, and mathematical reasoning. Method: We conduct systematic analysis under both single-task and multi-task instruction tuning regimes, employing controlled multilingual instruction data, cross-lingual performance attribution, and large-scale fine-tuning evaluation across Qwen and LLaMA model families. Contribution/Results: We首次 uncover that CLT exhibits nonlinear dependence on data mixing ratios, task complexity, and training paradigm combinations. We propose a reproducible efficacy criterion for CLT and identify optimal data proportioning and task-scheduling strategies that significantly enhance low-resource language performance—achieving a 27% absolute zero-shot cross-lingual accuracy gain in mathematical reasoning.

Technology Category

Application Category

📝 Abstract

In order for large language models to be useful across the globe, they are fine-tuned to follow instructions on multilingual data. Despite the ubiquity of such post-training, a clear understanding of the dynamics that enable cross-lingual transfer remains elusive. This study examines cross-lingual transfer (CLT) dynamics in realistic post-training settings. We study two model families of up to 35B parameters in size trained on carefully controlled mixtures of multilingual data on three generative tasks with varying levels of complexity (summarization, instruction following, and mathematical reasoning) in both single-task and multi-task instruction tuning settings. Overall, we find that the dynamics of cross-lingual transfer and multilingual performance cannot be explained by isolated variables, varying depending on the combination of post-training settings. Finally, we identify the conditions that lead to effective cross-lingual transfer in practice.

Problem

Research questions and friction points this paper is trying to address.

Understanding cross-lingual transfer dynamics in multilingual training

Analyzing performance across tasks and model sizes

Identifying conditions for effective cross-lingual transfer

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning on multilingual instruction data

Analyzing cross-lingual transfer dynamics

Controlled multilingual data mixtures

🔎 Similar Papers

No similar papers found.