Machine Unlearning Fails to Remove Data Poisoning Attacks

📅 2024-06-25

🏛️ arXiv.org

📈 Citations: 6

✨ Influential: 1

career value

190K/year

🤖 AI Summary

This work exposes a systemic failure of existing approximate machine unlearning methods against diverse data poisoning attacks—including indiscriminate, targeted, and a newly proposed Gaussian poisoning attack—demonstrating that these methods fail to meaningfully mitigate poisoning effects in both image classifiers and large language models, performing no better than full retraining and thus posing a false sense of security. To address this, the authors introduce a unified evaluation framework covering multiple attack types and model architectures, propose a novel poisoning-aware unlearning effectiveness metric, and empirically benchmark mainstream approaches (e.g., gradient updates, influence function approximation, subset retraining). Results reveal that current methods lack theoretical guarantees and exhibit unreliable real-world behavior. The paper advocates for a more rigorous, scenario-driven unlearning evaluation paradigm, establishing foundational benchmarks and research directions for trustworthy machine unlearning.

Technology Category

Application Category

📝 Abstract

We revisit the efficacy of several practical methods for approximate machine unlearning developed for large-scale deep learning. In addition to complying with data deletion requests, one often-cited potential application for unlearning methods is to remove the effects of poisoned data. We experimentally demonstrate that, while existing unlearning methods have been demonstrated to be effective in a number of settings, they fail to remove the effects of data poisoning across a variety of types of poisoning attacks (indiscriminate, targeted, and a newly-introduced Gaussian poisoning attack) and models (image classifiers and LLMs); even when granted a relatively large compute budget. In order to precisely characterize unlearning efficacy, we introduce new evaluation metrics for unlearning based on data poisoning. Our results suggest that a broader perspective, including a wider variety of evaluations, are required to avoid a false sense of confidence in machine unlearning procedures for deep learning without provable guarantees. Moreover, while unlearning methods show some signs of being useful to efficiently remove poisoned data without having to retrain, our work suggests that these methods are not yet ``ready for prime time,'' and currently provide limited benefit over retraining.

Problem

Research questions and friction points this paper is trying to address.

Evaluate machine unlearning efficacy against data poisoning attacks

Test unlearning methods on various poisoning types and models

Propose new metrics for assessing unlearning performance in poisoning scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating unlearning with poisoning attack metrics

Testing unlearning on various poisoning attack types

Assessing unlearning efficacy without provable guarantees

🔎 Similar Papers

No similar papers found.