Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance

📅 2026-04-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

173K/year
🤖 AI Summary
This work addresses the challenge that multilingual large language models struggle with cross-lingual tasks due to imbalanced pretraining data and monolingual bias. To mitigate this, the study introduces, for the first time, a bidirectional cross-lingual mapping objective during pretraining, which explicitly aligns embedding spaces across languages. This approach significantly enhances cross-lingual understanding and generation while preserving monolingual fluency. The authors propose a language alignment coefficient to quantify cross-lingual consistency and enable robust evaluation under low-resource conditions. Experimental results demonstrate substantial improvements: up to +11.9 BLEU points in machine translation, +6.72 points in BERTScore-Precision for cross-lingual question answering, and over 5% absolute gains in cross-lingual natural language understanding accuracy.

Technology Category

Application Category

📝 Abstract
Multilingual Large Language Models (LLMs) struggle with cross-lingual tasks due to data imbalances between high-resource and low-resource languages, as well as monolingual bias in pre-training. Existing methods, such as bilingual fine-tuning and contrastive alignment, can improve cross-lingual performance, but they often require extensive parallel data or suffer from instability. To address these challenges, we introduce a Cross-Lingual Mapping Task during the pre-training phase, which enhances cross-lingual alignment without compromising monolingual fluency. Our approach bi-directionally maps languages within the LLM embedding space, improving both language generation and comprehension. We further propose a Language Alignment Coefficient to robustly quantify cross-lingual consistency, even in limited-data scenarios. Experimental results on machine translation (MT), cross-lingual natural language understanding (CLNLU), and cross-lingual question answering (CLQA) show that our model achieves gains of up to 11.9 BLEU points in MT, 6.72 points in CLQA BERTScore-Precision, and more than 5% in CLNLU accuracy over strong multilingual baselines. These findings highlight the potential of incorporating cross-lingual objectives into pre-training to improve multilingual LLMs.
Problem

Research questions and friction points this paper is trying to address.

cross-lingual
multilingual LLMs
data imbalance
monolingual bias
pre-training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Lingual Mapping
Pre-Training
Language Alignment Coefficient
Multilingual LLMs
Embedding Space Alignment