Data Science and Technology Towards AGI Part I: Tiered Data Management

📅 2026-02-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the growing bottlenecks in large language model (LLM) training caused by overreliance on scaling data volume, which faces limitations in data availability, cost, and efficiency. The authors propose a hierarchical data management framework spanning the entire training lifecycle (L0–L4), which uniquely integrates LLMs directly into the data processing pipeline. This framework establishes a five-tier hierarchy transforming raw corpora into verifiable knowledge, enabling model-guided data filtering, quality scoring, and dynamic allocation. It supports heterogeneous learning objectives across pretraining, mid-training, and alignment stages, significantly enhancing both training efficiency and model performance. To foster co-evolution of data and models, the project releases the hierarchical dataset and accompanying tools as open-source resources.

Technology Category

Application Category

📝 Abstract
The development of artificial intelligence can be viewed as an evolution of data-driven learning paradigms, with successive shifts in data organization and utilization continuously driving advances in model capability. Current LLM research is dominated by a paradigm that relies heavily on unidirectional scaling of data size, increasingly encountering bottlenecks in data availability, acquisition cost, and training efficiency. In this work, we argue that the development of AGI is entering a new phase of data-model co-evolution, in which models actively guide data management while high-quality data, in turn, amplifies model capabilities. To implement this vision, we propose a tiered data management framework, designed to support the full LLM training lifecycle across heterogeneous learning objectives and cost constraints. Specifically, we introduce an L0-L4 tiered data management framework, ranging from raw uncurated resources to organized and verifiable knowledge. Importantly, LLMs are fully used in data management processes, such as quality scoring and content editing, to refine data across tiers. Each tier is characterized by distinct data properties, management strategies, and training roles, enabling data to be strategically allocated across LLM training stages, including pre-training, mid-training, and alignment. The framework balances data quality, acquisition cost, and marginal training benefit, providing a systematic approach to scalable and sustainable data management. We validate the effectiveness of the proposed framework through empirical studies, in which tiered datasets are constructed from raw corpora and used across multiple training phases. Experimental results demonstrate that tier-aware data utilization significantly improves training efficiency and model performance. To facilitate further research, we release our tiered datasets and processing tools to the community.
Problem

Research questions and friction points this paper is trying to address.

data management
large language models
training efficiency
data scaling
AGI
Innovation

Methods, ideas, or system contributions that make the work stand out.

tiered data management
data-model co-evolution
large language models
data quality refinement
training efficiency
🔎 Similar Papers
No similar papers found.
Y
Yudong Wang
Tsinghua University
Zixuan Fu
Zixuan Fu
Nanyang Technological University
Image RestorationGenerative ModelsLow-level Vision
H
Hengyu Zhao
ModelBest Inc.; Beijing Institute of Technology
C
Chen Zhao
ModelBest Inc.
C
Chuyue Zhou
ModelBest Inc.
X
Xinle Lin
ModelBest Inc.; South China Agricultural University
H
Hongya Lyu
ModelBest Inc.
S
Shuaikang Xue
ModelBest Inc.
Y
Yi Yi
ModelBest Inc.
Y
Yingjiao Wang
ModelBest Inc.
Z
Zhi Zheng
ModelBest Inc.
Y
Yuzhou Zhang
ModelBest Inc.
J
Jie Zhou
ModelBest Inc.
Chaojun Xiao
Chaojun Xiao
Postdoctoral Researcher, Tsinghua University
Large Language Model
Xu Han
Xu Han
Research Assistant Professor, Tsinghua University
Natural Language ProcessingLarge Language ModelKnowledge GraphInformation Extraction
Zhiyuan Liu
Zhiyuan Liu
Tsinghua University
autonomous drivingtraffic simulation
Maosong Sun
Maosong Sun
Professor of Computer Science and Technology, Tsinghua University
Natural Language ProcessingArtificial IntelligenceSocial Computing