🤖 AI Summary
This work addresses the vulnerability of collaborative machine learning to strategic manipulation, where participants may submit duplicated or noisy data to inflate their valuations and secure higher rewards, thereby compromising fairness and truthfulness in model training. The paper proposes a Bayesian incentive mechanism that, for the first time, theoretically guarantees both collaborative fairness (F) and truthfulness incentives (T). By integrating semivalue methods—such as the Shapley value—with a validation-set-based data valuation function (DVF), the mechanism ensures incentive compatibility through game-theoretic equilibrium analysis. It further provides principled relaxations for practical scenarios lacking validation sets or operating under budget constraints. Experiments on both synthetic and real-world datasets demonstrate that, at equilibrium, participants maximize their expected utility by contributing truthful data.
📝 Abstract
Collaborative machine learning involves training high-quality models using datasets from a number of sources. To incentivize sources to share data, existing data valuation methods fairly reward each source based on its data submitted as is. However, as these methods do not verify nor incentivize data truthfulness, the sources can manipulate their data (e.g., by submitting duplicated or noisy data) to artificially increase their valuations and rewards or prevent others from benefiting. This paper presents the first mechanism that provably ensures (F) collaborative fairness and incentivizes (T) truthfulness at equilibrium for Bayesian models. Our mechanism combines semivalues (e.g., Shapley value), which ensure fairness, and a truthful data valuation function (DVF) based on a validation set that is unknown to the sources. As semivalues are influenced by others' data, we introduce an additional condition to prove that a source can maximize its expected data values in coalitions and semivalues by submitting a dataset that captures its true knowledge. Additionally, we discuss the implications and suitable relaxations of (F) and (T) when the mediator has a limited budget for rewards or lacks a validation set. Our theoretical findings are validated on synthetic and real-world datasets.