Data Shapley in One Training Run

📅 2024-06-16

🏛️ International Conference on Learning Representations

📈 Citations: 12

✨ Influential: 1

career value

175K/year

🤖 AI Summary

Existing Data Shapley methods require repeated retraining of data subsets, incurring prohibitive computational overhead, and yield generic contribution scores that cannot be tailored to specific target models. Method: We propose In-Run Data Shapley—the first framework enabling efficient, model-specific data contribution attribution for a single training run, including large language models. It embeds Shapley value theory directly into the dynamic parameter update process via gradient tracing and stochastic linear approximation, eliminating the need for auxiliary training. Contribution/Results: The method incurs negligible attribution overhead and supports fine-grained, pretraining-stage quantification of data value. Experiments demonstrate its interpretability and practical utility in copyright provenance and data curation. By bypassing iterative retraining, In-Run Data Shapley overcomes the computational bottleneck hindering data value assessment in large-scale models.

Technology Category

Application Category

📝 Abstract

Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts. However, existing approaches require re-training models on different data subsets, which is computationally intensive, foreclosing their application to large-scale models. Furthermore, they produce the same attribution score for any models produced by running the learning algorithm, meaning they cannot perform targeted attribution towards a specific model obtained from a single run of the algorithm. This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest. In its most efficient implementation, our technique incurs negligible additional runtime compared to standard model training. This dramatic efficiency improvement makes it possible to perform data attribution for the foundation model pretraining stage for the first time. We present several case studies that offer fresh insights into pretraining data's contribution and discuss their implications for copyright in generative AI and pretraining data curation.

Problem

Research questions and friction points this paper is trying to address.

Scalable data attribution for target models

Efficient computation without retraining subsets

Insights into pretraining data contributions

Innovation

Methods, ideas, or system contributions that make the work stand out.

In-Run Data Shapley for scalable attribution

Negligible runtime overhead during training

Enables foundation model pretraining data analysis

🔎 Similar Papers

CHG Shapley: Efficient Data Valuation and Selection towards Trustworthy Machine Learning