🤖 AI Summary
This work investigates the root cause of cross-lingual performance inconsistency in bilingual/multilingual models, identifying imbalanced data quality—not merely data quantity—as the key factor behind the anomalous degradation of high-resource language performance (e.g., English). It is the first systematic study to reveal the dominant role of data quality in bilingual pretraining. The authors propose a lightweight, confidence- and consistency-based data quality scoring and filtering method leveraging language model outputs, and establish a comparative analytical framework for monolingual versus bilingual models. Crucially, high-quality English monolingual data alone suffices to effectively identify high-quality bilingual corpora. Empirical evaluation on French, German, and Chinese shows that monolingual performance improves by 2–4%, while inter-lingual performance gaps in bilingual models shrink to ≤1%, substantially alleviating multilingual performance imbalance.
📝 Abstract
Bilingual and multilingual language models offer a promising path toward scaling NLP systems across diverse languages and users. However, their performance often varies wildly between languages as prior works show that adding more languages can degrade performance for some languages (such as English), while improving others (typically more data constrained languages). In this work, we investigate causes of these inconsistencies by comparing bilingual and monolingual language models. Our analysis reveals that unequal data quality, not just data quantity, is a major driver of performance degradation in bilingual settings. We propose a simple yet effective data filtering strategy to select higher-quality bilingual training data with only high quality English data. Applied to French, German, and Chinese, our approach improves monolingual performance by 2-4% and reduces bilingual model performance gaps to 1%. These results highlight the overlooked importance of data quality in multilingual pretraining and offer a practical recipe for balancing performance.