🤖 AI Summary
Data markets employing volume-based reward mechanisms are vulnerable to manipulation, leading to an influx of spurious or low-quality data. Existing incentive schemes grounded in distributional comparison rely on strong parametric assumptions—such as Gaussianity—limiting their applicability. This paper proposes the first reward mechanism based on the nonparametric Cramér–von Mises two-sample test, requiring no prior distributional assumptions and guaranteeing approximate truthful reporting as a Nash equilibrium under both Bayesian and prior-free settings. Our approach integrates nonparametric statistical testing, game-theoretic modeling, and incentive mechanism design. Extensive evaluation on multimodal real-world datasets—including text and image domains—demonstrates significant improvements in genuine data submission rates, effective suppression of data forgery, and robust empirical performance across three canonical data-sharing scenarios, all supported by theoretical guarantees.
📝 Abstract
Modern data marketplaces and data sharing consortia increasingly rely on incentive mechanisms to encourage agents to contribute data. However, schemes that reward agents based on the quantity of submitted data are vulnerable to manipulation, as agents may submit fabricated or low-quality data to inflate their rewards. Prior work has proposed comparing each agent's data against others' to promote honesty: when others contribute genuine data, the best way to minimize discrepancy is to do the same. Yet prior implementations of this idea rely on very strong assumptions about the data distribution (e.g. Gaussian), limiting their applicability. In this work, we develop reward mechanisms based on a novel, two-sample test inspired by the Cram'er-von Mises statistic. Our methods strictly incentivize agents to submit more genuine data, while disincentivizing data fabrication and other types of untruthful reporting. We establish that truthful reporting constitutes a (possibly approximate) Nash equilibrium in both Bayesian and prior-agnostic settings. We theoretically instantiate our method in three canonical data sharing problems and show that it relaxes key assumptions made by prior work. Empirically, we demonstrate that our mechanism incentivizes truthful data sharing via simulations and on real-world language and image data.