🤖 AI Summary
This work addresses the vulnerability of existing machine unlearning methods to black-box attackers who reconstruct deleted concepts via adversarial prompts. Current attacks either rely on model weights or produce unnatural, easily detectable prompts. To overcome these limitations, we propose BEAP—a black-box, embedding-aware adversarial prompting attack that requires no access to model weights. BEAP leverages large language models to iteratively optimize prompts in textual space, integrating embedding-space search with a multi-objective reward mechanism that jointly optimizes for the presence of unlearned concepts, image-text alignment, and image quality. The resulting prompts are natural, readable, and capable of bypassing rule-based safety filters. Experiments demonstrate that BEAP improves attack success rates by over 60% compared to existing methods and achieves target image generation in an average of only 15 queries, significantly enhancing both stealth and effectiveness.
📝 Abstract
Machine unlearning aims to remove specific concepts from pretrained text-to-image diffusion models, yet several white- and black-box attacks have been introduced to make the model generate such unlearned concepts. These attacks, nevertheless, do not assume a realistic threat model, i.e. they either assume access to the model weights, or result in gibberish adversarial prompts that could be easily detected even through naive rule-based safeguarding. We aim to address this gap in this paper.
We introduce BEAP, a black-box, embedding-aware adversarial prompting attack that leverages a large language model (LLM) to iteratively generate effective adversarial prompts and exploit such hidden vulnerabilities.
BEAP performs an embedding-aware search in text space, combining multiple reward signals: unlearned concept presence, text-image alignment, and image quality, to refine generated prompts.
Unlike previous attack methods, BEAP keeps its prompts undetectable to safety filters while producing high-quality images.
Extensive experiments show that BEAP improves the Attack Success Rate (ASR) by more than 60% over prior methods, while requiring only an average of fifteen prompts per successful attack. Warning: This paper contains model outputs that may be offensive or upsetting in nature.