🤖 AI Summary
Low-resource languages suffer from a critical scarcity of high-quality multilingual textual data, severely constraining the development of large language models. To address this, we propose the first systematic, open-source web corpus construction framework specifically designed for low-resource languages. Our framework integrates multi-stage collaborative processing: adaptive web page extraction, language-aware cleaning, cross-document semantic deduplication, fine-grained safety filtering, multidimensional quality assessment, and topic-consistency classification—ensuring both linguistic diversity and enhanced data security and reliability. We publicly release high-quality corpora covering five low-resource languages. Empirical evaluations demonstrate superior data quality, safety, and usability compared to existing benchmarks. The datasets and code are fully open-sourced on OpenDataLab and GitHub, providing a robust, scalable foundation for multilingual large model training and research.
📝 Abstract
This paper introduces the open-source dataset WanJuanSiLu, designed to provide high-quality training corpora for low-resource languages, thereby advancing the research and development of multilingual models. To achieve this, we have developed a systematic data processing framework tailored for low-resource languages. This framework encompasses key stages such as data extraction, corpus cleaning, content deduplication, security filtering, quality evaluation, and theme classification. Through the implementation of this framework, we have significantly improved both the quality and security of the dataset, while maintaining its linguistic diversity. As of now, data for all five languages have been fully open-sourced. The dataset can be accessed at https://opendatalab.com/applyMultilingualCorpus, and GitHub repository is available at https://github.com/opendatalab/WanJuan3.0