🤖 AI Summary
To address the prohibitively high computational cost of conventional Data Shapley computation on large-scale datasets—hindering its adoption in trustworthy machine learning—this paper proposes the CHG utility function, which jointly incorporates data difficulty and gradient information. We derive, for the first time, a closed-form analytical solution for the Shapley values under CHG, reducing computational complexity from exponential to the cost of a single model retraining—achieving theoretical quadratic speedup. The method enables real-time data value assessment and selection without repeated training or sampling. Extensive evaluation across standard benchmarks, label-noisy datasets, and class-imbalanced settings demonstrates that CHG significantly improves identification accuracy of high-value samples and robustness in filtering noisy instances, thereby enhancing model robustness and interpretability. This work establishes a novel paradigm for efficient, trustworthy, data-driven modeling.
📝 Abstract
Understanding the decision-making process of machine learning models is crucial for ensuring trustworthy machine learning. Data Shapley, a landmark study on data valuation, advances this understanding by assessing the contribution of each datum to model performance. However, the resource-intensive and time-consuming nature of multiple model retraining poses challenges for applying Data Shapley to large datasets. To address this, we propose the CHG (compound of Hardness and Gradient) utility function, which approximates the utility of each data subset on model performance in every training epoch. By deriving the closed-form Shapley value for each data point using the CHG utility function, we reduce the computational complexity to that of a single model retraining, achieving a quadratic improvement over existing marginal contribution-based methods. We further leverage CHG Shapley for real-time data selection, conducting experiments across three settings: standard datasets, label noise datasets, and class imbalance datasets. These experiments demonstrate its effectiveness in identifying high-value and noisy data. By enabling efficient data valuation, CHG Shapley promotes trustworthy model training through a novel data-centric perspective. Our codes are available at https://github.com/caihuaiguang/CHG-Shapley-for-Data-Valuation and https://github.com/caihuaiguang/CHG-Shapley-for-Data-Selection.