In-Run Data Shapley for Adam Optimizer

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

This work addresses the limitation of existing in-run data contribution evaluation methods, which rely on linear assumptions tied to SGD and fail to generalize to adaptive optimizers like Adam, leading to inaccurate attribution. The study is the first to reveal the optimizer-dependence of data attribution and proposes a dynamic Shapley value estimation framework tailored for Adam. By introducing a fixed-state assumption to reformulate the utility function and leveraging a linearized ghost approximation combined with gradient inner products, the method enables efficient, scalable, and high-fidelity contribution estimation without model retraining or per-sample gradient storage. It achieves over 95% of original training throughput while maintaining a correlation coefficient exceeding 0.99 with true marginal contributions, and significantly outperforms SGD-based baselines in downstream tasks.

Technology Category

Application Category

📝 Abstract

Reliable data attribution is essential for mitigating bias and reducing computational waste in modern machine learning, with the Shapley value serving as the theoretical gold standard. While recent"In-Run"methods bypass the prohibitive cost of retraining by estimating contributions dynamically, they heavily rely on the linear structure of Stochastic Gradient Descent (SGD) and fail to capture the complex dynamics of adaptive optimizers like Adam. In this work, we demonstrate that data attribution is inherently optimizer-dependent: we show that SGD-based proxies diverge significantly from true contributions under Adam (Pearson $R \approx 0.11$), rendering them ineffective for modern training pipelines. To bridge this gap, we propose Adam-Aware In-Run Data Shapley. We derive a closed-form approximation that restores additivity by redefining utility under a fixed-state assumption and enable scalable computation via a novel Linearized Ghost Approximation. This technique linearizes the variance-dependent scaling term, allowing us to compute pairwise gradient dot-products without materializing per-sample gradients. Extensive experiments show that our method achieves near-perfect fidelity to ground-truth marginal contributions ($R>0.99$) while retaining $\sim$95\% of standard training throughput. Furthermore, our Adam-aware attribution significantly outperforms SGD-based baselines in data attribution downstream tasks.

Problem

Research questions and friction points this paper is trying to address.

Data Shapley

Adam optimizer

data attribution

in-run estimation

optimizer dependence

Innovation

Methods, ideas, or system contributions that make the work stand out.

In-Run Data Shapley

Adam Optimizer

Linearized Ghost Approximation