CCI4.0: A Bilingual Pretraining Dataset for Enhancing Reasoning in Large Language Models

📅 2025-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing bilingual pretraining data suffer from inconsistent quality, heterogeneous domain standards, high manual annotation costs, insufficient chain-of-thought (CoT) diversity, and severe hallucination. Method: This work constructs CCI4.0—a 35-TB high-quality bilingual pretraining dataset comprising the M2-Base corpus and 4.5 billion structured CoT templates (M2-CoT). We propose a two-stage model-driven deduplication scheme, multi-classifier collaborative quality scoring, domain-aware fluency filtering, and a phased CoT extraction methodology. Contribution/Results: These techniques significantly enhance reasoning coverage and suppress hallucination. Models pretrained on CCI4.0 achieve consistent performance gains on mathematical and code reflection benchmarks, empirically validating the critical role of high-quality data and human-like reasoning pattern modeling in advancing large language model reasoning capabilities.

Technology Category

Application Category

📝 Abstract
We introduce CCI4.0, a large-scale bilingual pre-training dataset engineered for superior data quality and diverse human-like reasoning trajectory. CCI4.0 occupies roughly $35$ TB of disk space and comprises two sub-datasets: CCI4.0-M2-Base and CCI4.0-M2-CoT. CCI4.0-M2-Base combines a $5.2$ TB carefully curated Chinese web corpus, a $22.5$ TB English subset from Nemotron-CC, and diverse sources from math, wiki, arxiv, and code. Although these data are mostly sourced from well-processed datasets, the quality standards of various domains are dynamic and require extensive expert experience and labor to process. So, we propose a novel pipeline justifying data quality mainly based on models through two-stage deduplication, multiclassifier quality scoring, and domain-aware fluency filtering. We extract $4.5$ billion pieces of CoT(Chain-of-Thought) templates, named CCI4.0-M2-CoT. Differing from the distillation of CoT from larger models, our proposed staged CoT extraction exemplifies diverse reasoning patterns and significantly decreases the possibility of hallucination. Empirical evaluations demonstrate that LLMs pre-trained in CCI4.0 benefit from cleaner, more reliable training signals, yielding consistent improvements in downstream tasks, especially in math and code reflection tasks. Our results underscore the critical role of rigorous data curation and human thinking templates in advancing LLM performance, shedding some light on automatically processing pretraining corpora.
Problem

Research questions and friction points this paper is trying to address.

Enhancing bilingual reasoning in large language models
Improving data quality via novel curation pipeline
Reducing hallucination with diverse reasoning templates
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bilingual dataset with human-like reasoning
Two-stage deduplication and quality scoring
Staged CoT extraction reduces hallucination
🔎 Similar Papers
No similar papers found.
Guang Liu
Guang Liu
BAAI
AI,LLMData
L
Liangdong Wang
Data Research Team, Beijing Academy of Artificial Intelligence
J
Jijie Li
Data Research Team, Beijing Academy of Artificial Intelligence
Y
Yang Yu
Data Research Team, Beijing Academy of Artificial Intelligence
Y
Yao Xu
Data Research Team, Beijing Academy of Artificial Intelligence
J
Jiabei Chen
Data Research Team, Beijing Academy of Artificial Intelligence
Y
Yu Bai
Data Research Team, Beijing Academy of Artificial Intelligence
F
Feng Liao
Data Research Team, Beijing Academy of Artificial Intelligence
Y
Yonghua Lin
Data Research Team, Beijing Academy of Artificial Intelligence