CrossIn: An Efficient Instruction Tuning Approach for Cross-Lingual Knowledge Alignment

📅 2024-04-18

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 1

career value

192K/year

🤖 AI Summary

Multilingual large language models exhibit significant performance degradation on non-English languages—particularly those typologically distant from English—primarily due to severe cross-lingual data distribution imbalance during pretraining and instruction tuning, which impedes effective knowledge transfer. To address this, we propose a hybrid cross-lingual instruction tuning paradigm that jointly enhances multilingual capabilities in a single stage via four key mechanisms: shared latent-space representation alignment, cross-lingual mixed sampling, translation-augmented data integration, and multi-granularity consistency modeling. Additionally, we introduce the first multi-task, multi-dimensional evaluation benchmark specifically designed for cross-lingual knowledge alignment. Extensive experiments across 12 languages demonstrate substantial average improvements on knowledge reasoning, question answering, and generation tasks. Crucially, both cross-lingual result consistency and accuracy are simultaneously enhanced, empirically validating the critical roles of balanced data scaling and translation-injected alignment.

Technology Category

Application Category

📝 Abstract

Multilingual proficiency presents a significant challenge for large language models (LLMs). English-centric models are usually suboptimal in other languages, particularly those that are linguistically distant from English. This performance discrepancy mainly stems from the imbalanced distribution of training data across languages during pre-training and instruction tuning stages. To address this problem, we propose a novel approach called CrossIn, which utilizes a mixed composition of cross-lingual instruction tuning data. Our method leverages the compressed representation shared by various languages to efficiently enhance the model's task-solving capabilities and multilingual proficiency within a single process. In addition, we introduce a multi-task and multi-faceted benchmark to evaluate the effectiveness of CrossIn. Experimental results demonstrate that our method substantially improves performance across tasks and languages, and we provide extensive insights into the impact of cross-lingual data volume and the integration of translation data on enhancing multilingual consistency and accuracy.

Problem

Research questions and friction points this paper is trying to address.

Multilingual Training

Data Imbalance

Knowledge Transfer

Innovation

Methods, ideas, or system contributions that make the work stand out.

CrossIn

Multilingual Training

Performance Enhancement

🔎 Similar Papers

No similar papers found.