On the Usage of Gaussian Process for Efficient Data Valuation

📅 2025-06-04
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses data valuation—the quantification of individual data points’ influence on model performance in machine learning. We propose an efficient, interpretable valuation framework based on Gaussian processes (GPs). Methodologically, we introduce GPs for the first time to model submodel utility, integrating Bayesian inference with a canonical decomposition of utility functions to enable rapid, incremental estimation of sample contributions. Our key contributions are: (1) the first interpretable decomposition framework for data valuation, explicitly attributing influence to individual samples and their interactions; and (2) leveraging GPs to jointly ensure theoretical rigor—via probabilistic modeling of utility uncertainty—and computational efficiency—through closed-form posterior updates. Experiments demonstrate that our approach accelerates valuation by over an order of magnitude compared to full retraining, while achieving high rank correlation with Leave-One-Out and other baselines in influence ranking. The framework thus enables near real-time, principled assessment of data value.

Technology Category

Application Category

📝 Abstract
In machine learning, knowing the impact of a given datum on model training is a fundamental task referred to as Data Valuation. Building on previous works from the literature, we have designed a novel canonical decomposition allowing practitioners to analyze any data valuation method as the combination of two parts: a utility function that captures characteristics from a given model and an aggregation procedure that merges such information. We also propose to use Gaussian Processes as a means to easily access the utility function on ``sub-models'', which are models trained on a subset of the training set. The strength of our approach stems from both its theoretical grounding in Bayesian theory, and its practical reach, by enabling fast estimation of valuations thanks to efficient update formulae.
Problem

Research questions and friction points this paper is trying to address.

Efficiently valuing data impact on model training
Decomposing data valuation into utility and aggregation
Using Gaussian Processes for fast valuation estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel canonical decomposition for data valuation
Gaussian Processes for utility function access
Efficient update formulae for fast estimation
🔎 Similar Papers
No similar papers found.
C
Cl'ement B'enesse
Opsci.ai, Paris, France
P
Patrick Mesana
Universit´ e du Qu´ ebec ` a Montr´ eal, Montr´ eal, Qu´ ebec, Canada; HEC Montr´ eal,, Montr´ eal, Qu´ ebec, Canada
A
Ath'enais Gautier
COSMO - Stochastic Mine Planning Laboratory, Department of Mining and Materials Engineering, McGill University, Montreal, Quebec, Canada
SĂŠbastien Gambs
SĂŠbastien Gambs
UniversitĂŠ du QuĂŠbec Ă  MontrĂŠal (UQAM)
PrivacySecurityEthics of AIMachine Learning