🤖 AI Summary
Low-resource language Wikipedia articles suffer from pervasive data quality issues—such as single-sentence entries, content duplication, and structural incompleteness—undermining their utility for multilingual NLP. Method: We conduct a systematic audit of multilingual Wikipedia (with emphasis on low-resource languages) and propose a language- and task-aware quality assessment framework. This framework integrates lightweight, multidimensional metrics—including length, redundancy, and structural completeness—to perform efficient data pruning. We evaluate its impact on downstream performance using mBERT and XLM-R. Contribution/Results: Pruning retains only 30–50% of highest-quality articles yet maintains or even improves performance on low-resource language NLP tasks. This demonstrates that quality-driven pruning enhances both training efficiency and model effectiveness. Our work challenges the assumption of uniform data quality across languages and provides a transferable methodology for data curation in multilingual pretraining.
📝 Abstract
Wikipedia's perceived high quality and broad language coverage have established it as a fundamental resource in multilingual NLP. In the context of low-resource languages, however, these quality assumptions are increasingly being scrutinised. This paper critically examines the data quality of Wikipedia in a non-English setting by subjecting it to various quality filtering techniques, revealing widespread issues such as a high percentage of one-line articles and duplicate articles. We evaluate the downstream impact of quality filtering on Wikipedia and find that data quality pruning is an effective means for resource-efficient training without hurting performance, especially for low-resource languages. Moreover, we advocate for a shift in perspective from seeking a general definition of data quality towards a more language- and task-specific one. Ultimately, we aim for this study to serve as a guide to using Wikipedia for pretraining in a multilingual setting.