Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs

📅 2026-04-12

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the critical yet underexplored role of post-training data in shaping large language model capabilities, where the absence of systematic lineage tracking often leads to redundancy and contamination. We introduce, for the first time, the concept of data provenance and develop a multi-agent framework to automatically reconstruct evolutionary graphs of post-training data. Building on this, we propose a provenance-aware analytical paradigm grounded in topological analysis, which supersedes conventional sample-level comparisons. By leveraging provenance-aware sampling and topological inspection, we uncover structural patterns such as vertical refinement in mathematical domains and horizontal aggregation of general-purpose corpora, while identifying sources of structural redundancy and benchmark contamination propagation. The resulting provenance-informed dataset significantly enhances diversity and effectively mitigates implicit duplication.

Technology Category

Application Category

📝 Abstract

Post-training data plays a pivotal role in shaping the capabilities of Large Language Models (LLMs), yet datasets are often treated as isolated artifacts, overlooking the systemic connections that underlie their evolution. To disentangle these complex relationships, we introduce the concept of \textbf{data lineage} to the LLM ecosystem and propose an automated multi-agent framework to reconstruct the evolutionary graph of dataset development. Through large-scale lineage analysis, we characterize domain-specific structural patterns, such as vertical refinement in math-oriented datasets and horizontal aggregation in general-domain corpora. Moreover, we uncover pervasive systemic issues, including \textit{structural redundancy} induced by implicit dataset intersections and the \textit{propagation of benchmark contamination} along lineage paths. To demonstrate the practical value of lineage analysis for data construction, we leverage the reconstructed lineage graph to create a \textit{lineage-aware diversity-oriented dataset}. By anchoring instruction sampling at upstream root sources, this approach mitigates downstream homogenization and hidden redundancy, yielding a more diverse post-training corpus. We further highlight lineage-centric analysis as an efficient and robust topological alternative to sample-level dataset comparison for large-scale data ecosystems. By grounding data construction in explicit lineage structures, our work advances post-training data curation toward a more systematic and controllable paradigm.

Problem

Research questions and friction points this paper is trying to address.

data lineage

post-training LLMs

dataset redundancy

benchmark contamination

data curation

Innovation

Methods, ideas, or system contributions that make the work stand out.

data lineage

multi-agent framework

post-training data curation

structural redundancy

lineage-aware dataset

🔎 Similar Papers

A Comprehensive Survey of Contamination Detection Methods in Large Language Models

2024-03-31Citations: 6

The Emerged Security and Privacy of LLM Agent: A Survey with Case Studies

2024-07-28arXiv.orgCitations: 62