Adversarial Attacks on Data Attribution

📅 2024-09-09

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

This work exposes the severe vulnerability of data attribution methods under adversarial settings: attackers can maliciously inflate attribution scores for specific training samples, thereby distorting data value assessment and compensation mechanisms. To address this, the authors first systematically formalize a threat model for data attribution and propose two novel attacks—gradient-free Shadow Attack (leveraging shadow training) and fully black-box Outlier Attack (exploiting attribution bias toward outliers). The methods integrate shadow training, black-box querying, membership inference transfer, and distribution-agnostic perturbation design. Extensive experiments across image classification and text generation tasks demonstrate that these attacks can amplify compensation amounts by 185%–643%. The code is publicly released.

Technology Category

Application Category

📝 Abstract

Data attribution aims to quantify the contribution of individual training data points to the outputs of an AI model, which has been used to measure the value of training data and compensate data providers. Given the impact on financial decisions and compensation mechanisms, a critical question arises concerning the adversarial robustness of data attribution methods. However, there has been little to no systematic research addressing this issue. In this work, we aim to bridge this gap by detailing a threat model with clear assumptions about the adversary's goal and capabilities and proposing principled adversarial attack methods on data attribution. We present two methods, Shadow Attack and Outlier Attack, which generate manipulated datasets to inflate the compensation adversarially. The Shadow Attack leverages knowledge about the data distribution in the AI applications, and derives adversarial perturbations through"shadow training", a technique commonly used in membership inference attacks. In contrast, the Outlier Attack does not assume any knowledge about the data distribution and relies solely on black-box queries to the target model's predictions. It exploits an inductive bias present in many data attribution methods - outlier data points are more likely to be influential - and employs adversarial examples to generate manipulated datasets. Empirically, in image classification and text generation tasks, the Shadow Attack can inflate the data-attribution-based compensation by at least 200%, while the Outlier Attack achieves compensation inflation ranging from 185% to as much as 643%. Our implementation is ready at https://github.com/TRAIS-Lab/adversarial-attack-data-attribution.

Problem

Research questions and friction points this paper is trying to address.

Assessing adversarial robustness of data attribution methods

Developing attacks to manipulate data attribution outcomes

Evaluating impact on compensation mechanisms in AI models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes Shadow Attack using shadow training

Introduces Outlier Attack via black-box queries

Generates manipulated datasets for adversarial compensation

🔎 Similar Papers

No similar papers found.