🤖 AI Summary
Adversarial prompts can bypass concept deletion and reconstruct forgotten content in machine unlearning (MU) models, posing serious security risks. Method: This paper proposes a zero-shot, intent-aware adversarial attack method that requires neither iterative optimization nor additional training; instead, it leverages an intent-driven prompt generation mechanism to flexibly specify attack targets and efficiently trigger forgotten concepts. Contribution/Results: To the best of our knowledge, this is the first zero-shot, intent-customizable attack against MU models, significantly enhancing both attack flexibility and efficiency. Extensive experiments across diverse unlearning scenarios demonstrate that the proposed method achieves higher attack success rates than state-of-the-art approaches while substantially reducing attack latency, thereby establishing a novel evaluation paradigm for assessing the security of MU models.
📝 Abstract
Machine unlearning (MU) removes specific data points or concepts from deep learning models to enhance privacy and prevent sensitive content generation. Adversarial prompts can exploit unlearned models to generate content containing removed concepts, posing a significant security risk. However, existing adversarial attack methods still face challenges in generating content that aligns with an attacker's intent while incurring high computational costs to identify successful prompts. To address these challenges, we propose ZIUM, a Zero-shot Intent-aware adversarial attack on Unlearned Models, which enables the flexible customization of target attack images to reflect an attacker's intent. Additionally, ZIUM supports zero-shot adversarial attacks without requiring further optimization for previously attacked unlearned concepts. The evaluation across various MU scenarios demonstrated ZIUM's effectiveness in successfully customizing content based on user-intent prompts while achieving a superior attack success rate compared to existing methods. Moreover, its zero-shot adversarial attack significantly reduces the attack time for previously attacked unlearned concepts.