The Role of Mixed-Language Documents for Multilingual Large Language Model Pretraining

📅 2026-01-01

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This study investigates the role of bilingual documents in the pretraining of multilingual large language models and their contribution to cross-lingual capabilities. By training models from scratch under controlled conditions—comparing standard corpora against strictly monolingual corpora with all multilingual documents removed—the authors quantitatively demonstrate for the first time that bilingual documents, though constituting only 2% of the training data, are critical for translation performance: their removal causes a 56% drop in BLEU score, while other cross-lingual tasks remain largely unaffected. Further analysis reveals that parallel texts, through token-level alignment, recover 91% of translation performance, whereas code-switched texts contribute minimally. These findings indicate that high-quality translation relies heavily on explicit alignment signals, whereas cross-lingual understanding can emerge without direct exposure to bilingual data.

Technology Category

Application Category

📝 Abstract

Multilingual large language models achieve impressive cross-lingual performance despite largely monolingual pretraining. While bilingual data in pretraining corpora is widely believed to enable these abilities, details of its contributions remain unclear. We investigate this question by pretraining models from scratch under controlled conditions, comparing the standard web corpus with a monolingual-only version that removes all multilingual documents. Despite constituting only 2% of the corpus, removing bilingual data causes translation performance to drop 56% in BLEU, while behaviour on cross-lingual QA and general reasoning tasks remains stable, with training curves largely overlapping the baseline. To understand this asymmetry, we categorize bilingual data into parallel (14%), code-switching (72%), and miscellaneous documents (14%) based on the semantic relevance of content in different languages. We then conduct granular ablations by reintroducing parallel or code-switching data into the monolingual-only corpus. Our experiments reveal that parallel data almost fully restores translation performance (91% of the unfiltered baseline), whereas code-switching contributes minimally. Other cross-lingual tasks remain largely unaffected by either type. These findings reveal that translation critically depends on systematic token-level alignments from parallel data, whereas cross-lingual understanding and reasoning appear to be achievable even without bilingual data.

Problem

Research questions and friction points this paper is trying to address.

multilingual large language models

bilingual data

cross-lingual performance

parallel data

code-switching

Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual pretraining

bilingual data

parallel corpora