🤖 AI Summary
This paper identifies a previously overlooked implicit acquisition cost in data valuation: conventional methods (e.g., Shapley value) allocate only marginal utility, neglecting the non-negligible collection and evaluation costs associated with zero-marginal-value data points. To address this, we formally define this implicit cost and propose a *disclosure game model* between data coalitions and consumers, analyzing how incremental disclosure strategies affect cost allocation under differential privacy constraints. Methodologically, we integrate the Laplace noise mechanism, Shapley value computation, and multi-armed bandit exploration to enable dynamic value estimation and strategic optimization. Experiments on the Yelp helpfulness prediction task demonstrate that data valuation indeed incurs substantial explicit acquisition costs; furthermore, coordinated disclosure policies reshape cost distribution, enhancing both fairness and efficiency across the coalition.
📝 Abstract
Data valuation methods assign marginal utility to each data point that has contributed to the training of a machine learning model. If used directly as a payout mechanism, this creates a hidden cost of valuation, in which contributors with near-zero marginal value would receive nothing, even though their data had to be collected and assessed. To better formalize this cost, we introduce a conceptual and game-theoretic model, the Information Disclosure Game, between a Data Union (sometimes also called a data trust), a member-run agent representing contributors, and a Data Consumer (e.g., a platform). After first aggregating members' data, the DU releases information progressively by adding Laplacian noise under a differentially-private mechanism. Through simulations with strategies guided by data Shapley values and multi-armed bandit exploration, we demonstrate on a Yelp review helpfulness prediction task that data valuation inherently incurs an explicit acquisition cost and that the DU's collective disclosure policy changes how this cost is distributed across members.