CCI4.0: A Bilingual Pretraining Dataset for Enhancing Reasoning in Large Language Models

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Existing bilingual pretraining data suffer from inconsistent quality, heterogeneous domain standards, high manual annotation costs, insufficient chain-of-thought (CoT) diversity, and severe hallucination. Method: This work constructs CCI4.0—a 35-TB high-quality bilingual pretraining dataset comprising the M2-Base corpus and 4.5 billion structured CoT templates (M2-CoT). We propose a two-stage model-driven deduplication scheme, multi-classifier collaborative quality scoring, domain-aware fluency filtering, and a phased CoT extraction methodology. Contribution/Results: These techniques significantly enhance reasoning coverage and suppress hallucination. Models pretrained on CCI4.0 achieve consistent performance gains on mathematical and code reflection benchmarks, empirically validating the critical role of high-quality data and human-like reasoning pattern modeling in advancing large language model reasoning capabilities.

Technology Category

Application Category

📝 Abstract

We introduce CCI4.0, a large-scale bilingual pre-training dataset engineered for superior data quality and diverse human-like reasoning trajectory. CCI4.0 occupies roughly $35$ TB of disk space and comprises two sub-datasets: CCI4.0-M2-Base and CCI4.0-M2-CoT. CCI4.0-M2-Base combines a $5.2$ TB carefully curated Chinese web corpus, a $22.5$ TB English subset from Nemotron-CC, and diverse sources from math, wiki, arxiv, and code. Although these data are mostly sourced from well-processed datasets, the quality standards of various domains are dynamic and require extensive expert experience and labor to process. So, we propose a novel pipeline justifying data quality mainly based on models through two-stage deduplication, multiclassifier quality scoring, and domain-aware fluency filtering. We extract $4.5$ billion pieces of CoT(Chain-of-Thought) templates, named CCI4.0-M2-CoT. Differing from the distillation of CoT from larger models, our proposed staged CoT extraction exemplifies diverse reasoning patterns and significantly decreases the possibility of hallucination. Empirical evaluations demonstrate that LLMs pre-trained in CCI4.0 benefit from cleaner, more reliable training signals, yielding consistent improvements in downstream tasks, especially in math and code reflection tasks. Our results underscore the critical role of rigorous data curation and human thinking templates in advancing LLM performance, shedding some light on automatically processing pretraining corpora.

Problem

Research questions and friction points this paper is trying to address.

Enhancing bilingual reasoning in large language models

Improving data quality via novel curation pipeline

Reducing hallucination with diverse reasoning templates

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bilingual dataset with human-like reasoning

Two-stage deduplication and quality scoring

Staged CoT extraction reduces hallucination

🔎 Similar Papers

M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models