OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training

📅 2025-01-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Chinese large language models (LLMs) suffer from a scarcity of high-quality training data. To address this, we propose the first systematic, high-quality Chinese corpus framework covering pretraining, post-training, and fine-tuning stages. We introduce a three-category taxonomy—knowledge-intensive, education-filtered, and dialogue-style—and integrate heterogeneous, multi-source subsets including Fineweb-edu-chinese, Cosmopedia-chinese, and Smoltalk-chinese. Our methodology employs web deduplication and cleaning, synthetic textbook generation, dialogue-style modeling, and multi-tier quality evaluation, supported by automated provenance tracking and a scalable data pipeline. Empirical evaluation on benchmarks such as C-Eval demonstrates significant performance gains for small- and medium-sized models, validating corpus efficacy. All datasets are publicly released to advance standardized, reproducible research in Chinese LLM development.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated remarkable capabilities, but their success heavily relies on the quality of pretraining corpora. For Chinese LLMs, the scarcity of high-quality Chinese datasets presents a significant challenge, often limiting their performance. To address this issue, we propose the OpenCSG Chinese Corpus, a series of high-quality datasets specifically designed for LLM pretraining, post-training, and fine-tuning. This corpus includes Fineweb-edu-chinese, Fineweb-edu-chinese-v2, Cosmopedia-chinese, and Smoltalk-chinese, each with distinct characteristics: Fineweb-edu datasets focus on filtered, high-quality content derived from diverse Chinese web sources; Cosmopedia-chinese provides synthetic, textbook-style data for knowledge-intensive training; and Smoltalk-chinese emphasizes stylistic and diverse chat-format data. The OpenCSG Chinese Corpus is characterized by its high-quality text, diverse coverage across domains, and scalable, reproducible data curation processes. Additionally, we conducted extensive experimental analyses, including evaluations on smaller parameter models, which demonstrated significant performance improvements in tasks such as C-Eval, showcasing the effectiveness of the corpus for training Chinese LLMs.
Problem

Research questions and friction points this paper is trying to address.

Chinese Corpus
Language Models
Training Quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

OpenCSG
Chinese Language Models
Comprehensive Corpus
🔎 Similar Papers
No similar papers found.
Yijiong Yu
Yijiong Yu
Master Student, Tsinghua University
Natural Language ProcessingMachine Learning
Z
Ziyun Dai
OpenCSG
Z
Zekun Wang
OpenCSG
W
Wei Wang
OpenCSG
R
Ran Chen
OpenCSG
J
Ji Pei
OpenCSG