WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages

📅 2025-01-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Low-resource languages suffer from a critical scarcity of high-quality multilingual textual data, severely constraining the development of large language models. To address this, we propose the first systematic, open-source web corpus construction framework specifically designed for low-resource languages. Our framework integrates multi-stage collaborative processing: adaptive web page extraction, language-aware cleaning, cross-document semantic deduplication, fine-grained safety filtering, multidimensional quality assessment, and topic-consistency classification—ensuring both linguistic diversity and enhanced data security and reliability. We publicly release high-quality corpora covering five low-resource languages. Empirical evaluations demonstrate superior data quality, safety, and usability compared to existing benchmarks. The datasets and code are fully open-sourced on OpenDataLab and GitHub, providing a robust, scalable foundation for multilingual large model training and research.

Technology Category

Application Category

📝 Abstract
This paper introduces the open-source dataset WanJuanSiLu, designed to provide high-quality training corpora for low-resource languages, thereby advancing the research and development of multilingual models. To achieve this, we have developed a systematic data processing framework tailored for low-resource languages. This framework encompasses key stages such as data extraction, corpus cleaning, content deduplication, security filtering, quality evaluation, and theme classification. Through the implementation of this framework, we have significantly improved both the quality and security of the dataset, while maintaining its linguistic diversity. As of now, data for all five languages have been fully open-sourced. The dataset can be accessed at https://opendatalab.com/applyMultilingualCorpus, and GitHub repository is available at https://github.com/opendatalab/WanJuan3.0
Problem

Research questions and friction points this paper is trying to address.

Multi-language Datasets
Resource-poor Languages
Text Data Quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual Dataset
Quality Improvement
Resource-poor Languages
🔎 Similar Papers
No similar papers found.
Jia Yu
Jia Yu
Co-founder, Wherobots Inc.; Assistant Professor of Computer Science, Washington State University
Database systemsData managementGeospatial databasesGIS
Fei Yuan
Fei Yuan
Minnesota State University, Mankato
remote sensingGISenvironmental monitoring and assessmentnatural resource mapping
Rui Min
Rui Min
Hong Kong University of Science and Technology
Machine LearningAgentTrustworthy AI
Jing Yu
Jing Yu
Northwestern University
SustainabilityLife Cycle AnalysisTransportation ManagementOperations Research
P
Pei Chu
Shanghai Artificial Intelligence Laboratory, Shanghai, 200232, China
J
Jiayang Li
Shanghai Artificial Intelligence Laboratory, Shanghai, 200232, China
W
Wei Li
Shanghai Artificial Intelligence Laboratory, Shanghai, 200232, China
R
Ruijie Zhang
Shanghai Artificial Intelligence Laboratory, Shanghai, 200232, China
Z
Zhenxiang Li
Shanghai Artificial Intelligence Laboratory, Shanghai, 200232, China
Z
Zhifei Ren
Shanghai Artificial Intelligence Laboratory, Shanghai, 200232, China
Dong Zheng
Dong Zheng
Shanghai Artificial Intelligence Laboratory, Shanghai, 200232, China
W
Wenjian Zhang
Shanghai Artificial Intelligence Laboratory, Shanghai, 200232, China
Y
Yan Teng
Shanghai Artificial Intelligence Laboratory, Shanghai, 200232, China
L
Lingyu Meng
Shanghai Artificial Intelligence Laboratory, Shanghai, 200232, China
Z
ZhenJiang Jin
Shanghai Artificial Intelligence Laboratory, Shanghai, 200232, China
Jiantao Qiu
Jiantao Qiu
EE department of Tsinghua University
S
ShaSha Wang
Shanghai Artificial Intelligence Laboratory, Shanghai, 200232, China
Z
Zhongying Tu
Shanghai Artificial Intelligence Laboratory, Shanghai, 200232, China
Dahua Lin
Dahua Lin
The Chinese University of Hong Kong
computer visionmachine learningprobabilistic inferencebayesian nonparametrics
Y
Yu Wang
Shanghai Artificial Intelligence Laboratory, Shanghai, 200232, China
Y
Yu Qiao
Shanghai Artificial Intelligence Laboratory, Shanghai, 200232, China
Yanfeng Wang
Yanfeng Wang
Shanghai Jiao Tong University
Conghui He
Conghui He
Shanghai AI Laboratory
Data-centric AILLMDocument Intelligence