Erased but Exploitable: Black-box Embedding-Aware Prompting Against Unlearned Text-to-Image Diffusion Models

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses the vulnerability of existing machine unlearning methods to black-box attackers who reconstruct deleted concepts via adversarial prompts. Current attacks either rely on model weights or produce unnatural, easily detectable prompts. To overcome these limitations, we propose BEAP—a black-box, embedding-aware adversarial prompting attack that requires no access to model weights. BEAP leverages large language models to iteratively optimize prompts in textual space, integrating embedding-space search with a multi-objective reward mechanism that jointly optimizes for the presence of unlearned concepts, image-text alignment, and image quality. The resulting prompts are natural, readable, and capable of bypassing rule-based safety filters. Experiments demonstrate that BEAP improves attack success rates by over 60% compared to existing methods and achieves target image generation in an average of only 15 queries, significantly enhancing both stealth and effectiveness.

📝 Abstract

Machine unlearning aims to remove specific concepts from pretrained text-to-image diffusion models, yet several white- and black-box attacks have been introduced to make the model generate such unlearned concepts. These attacks, nevertheless, do not assume a realistic threat model, i.e. they either assume access to the model weights, or result in gibberish adversarial prompts that could be easily detected even through naive rule-based safeguarding. We aim to address this gap in this paper. We introduce BEAP, a black-box, embedding-aware adversarial prompting attack that leverages a large language model (LLM) to iteratively generate effective adversarial prompts and exploit such hidden vulnerabilities. BEAP performs an embedding-aware search in text space, combining multiple reward signals: unlearned concept presence, text-image alignment, and image quality, to refine generated prompts. Unlike previous attack methods, BEAP keeps its prompts undetectable to safety filters while producing high-quality images. Extensive experiments show that BEAP improves the Attack Success Rate (ASR) by more than 60% over prior methods, while requiring only an average of fifteen prompts per successful attack. Warning: This paper contains model outputs that may be offensive or upsetting in nature.

Problem

Research questions and friction points this paper is trying to address.

machine unlearning

text-to-image diffusion models

black-box attack

adversarial prompting

concept removal

Innovation

Methods, ideas, or system contributions that make the work stand out.

black-box attack

embedding-aware prompting

machine unlearning