🤖 AI Summary
This work exposes a systemic failure of existing approximate machine unlearning methods against diverse data poisoning attacks—including indiscriminate, targeted, and a newly proposed Gaussian poisoning attack—demonstrating that these methods fail to meaningfully mitigate poisoning effects in both image classifiers and large language models, performing no better than full retraining and thus posing a false sense of security. To address this, the authors introduce a unified evaluation framework covering multiple attack types and model architectures, propose a novel poisoning-aware unlearning effectiveness metric, and empirically benchmark mainstream approaches (e.g., gradient updates, influence function approximation, subset retraining). Results reveal that current methods lack theoretical guarantees and exhibit unreliable real-world behavior. The paper advocates for a more rigorous, scenario-driven unlearning evaluation paradigm, establishing foundational benchmarks and research directions for trustworthy machine unlearning.
📝 Abstract
We revisit the efficacy of several practical methods for approximate machine unlearning developed for large-scale deep learning. In addition to complying with data deletion requests, one often-cited potential application for unlearning methods is to remove the effects of poisoned data. We experimentally demonstrate that, while existing unlearning methods have been demonstrated to be effective in a number of settings, they fail to remove the effects of data poisoning across a variety of types of poisoning attacks (indiscriminate, targeted, and a newly-introduced Gaussian poisoning attack) and models (image classifiers and LLMs); even when granted a relatively large compute budget. In order to precisely characterize unlearning efficacy, we introduce new evaluation metrics for unlearning based on data poisoning. Our results suggest that a broader perspective, including a wider variety of evaluations, are required to avoid a false sense of confidence in machine unlearning procedures for deep learning without provable guarantees. Moreover, while unlearning methods show some signs of being useful to efficiently remove poisoned data without having to retrain, our work suggests that these methods are not yet ``ready for prime time,'' and currently provide limited benefit over retraining.