Training Bilingual LMs with Data Constraints in the Targeted Language

📅 2024-11-20

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Large language models (LLMs) for low-resource target languages suffer from performance degradation due to insufficient monolingual pretraining data. Method: This paper proposes a dynamic upsampling strategy for auxiliary data based on linguistic relatedness, leveraging high-resource language (e.g., English) pretraining corpora to enhance target-language model performance—without modifying model architecture or training objectives. The approach integrates multilingual mixed pretraining, cross-lingual performance attribution analysis, and translation system evaluation to empirically validate the cross-lingual transferability of English data’s quality gains. Contribution/Results: Experiments demonstrate that, for genetically close target languages, incorporating only a small amount of target-language data alongside upsampled English data achieves performance comparable to full monolingual training. This significantly alleviates the model scaling bottleneck under strict monolingual data constraints and establishes an efficient, practical paradigm for LLM pretraining in low-resource language settings.

Technology Category

Application Category

📝 Abstract

Large language models are trained on massive scrapes of the web, as required by current scaling laws. Most progress is made for English, given its abundance of high-quality pretraining data. For most other languages, however, such high quality pretraining data is unavailable. In this work, we study how to boost pretrained model performance in a target language with insufficient pretraining data for training a high performing language model, by enlisting data from an auxiliary language for which high quality data is available. We study this by quantifying the performance gap between training with data in a data-rich auxiliary language compared with training in the target language, exploring the benefits of translation systems, studying the limitations of model scaling when data is limited in the target languages, and proposing new methods for upsampling data from the auxiliary language. Our results show that stronger auxiliary datasets result in performance gains without modification to the model or training objective for close languages, and, in particular, that performance gains due to the development of more information-rich English pretraining datasets can extend to targeted language settings with limited data.

Problem

Research questions and friction points this paper is trying to address.

Improving bilingual LM performance with scarce data

Utilizing auxiliary language data for target language enhancement

Addressing data limitations in non-English language model training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes auxiliary language data

Enhances target language models

Proposes upsampling methods

🔎 Similar Papers

No similar papers found.