Assessing the Role of Data Quality in Training Bilingual Language Models

📅 2025-06-15

📈 Citations: 0

✨ Influential: 0

career value

150K/year

🤖 AI Summary

This work investigates the root cause of cross-lingual performance inconsistency in bilingual/multilingual models, identifying imbalanced data quality—not merely data quantity—as the key factor behind the anomalous degradation of high-resource language performance (e.g., English). It is the first systematic study to reveal the dominant role of data quality in bilingual pretraining. The authors propose a lightweight, confidence- and consistency-based data quality scoring and filtering method leveraging language model outputs, and establish a comparative analytical framework for monolingual versus bilingual models. Crucially, high-quality English monolingual data alone suffices to effectively identify high-quality bilingual corpora. Empirical evaluation on French, German, and Chinese shows that monolingual performance improves by 2–4%, while inter-lingual performance gaps in bilingual models shrink to ≤1%, substantially alleviating multilingual performance imbalance.

Technology Category

Application Category

📝 Abstract

Bilingual and multilingual language models offer a promising path toward scaling NLP systems across diverse languages and users. However, their performance often varies wildly between languages as prior works show that adding more languages can degrade performance for some languages (such as English), while improving others (typically more data constrained languages). In this work, we investigate causes of these inconsistencies by comparing bilingual and monolingual language models. Our analysis reveals that unequal data quality, not just data quantity, is a major driver of performance degradation in bilingual settings. We propose a simple yet effective data filtering strategy to select higher-quality bilingual training data with only high quality English data. Applied to French, German, and Chinese, our approach improves monolingual performance by 2-4% and reduces bilingual model performance gaps to 1%. These results highlight the overlooked importance of data quality in multilingual pretraining and offer a practical recipe for balancing performance.

Problem

Research questions and friction points this paper is trying to address.

Investigates performance inconsistencies in bilingual language models

Identifies unequal data quality as a key performance degradation factor

Proposes data filtering to improve bilingual model performance gaps

Innovation

Methods, ideas, or system contributions that make the work stand out.

Data filtering strategy improves bilingual training

Equal data quality reduces performance gaps

High-quality English data enhances monolingual results

🔎 Similar Papers

No similar papers found.