M-ErasureBench: A Comprehensive Multimodal Evaluation Benchmark for Concept Erasure in Diffusion Models

📅 2025-12-28

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

Text-to-image diffusion models risk generating harmful or copyright-infringing content; existing concept erasure methods focus solely on textual prompts, neglecting critical multimodal input scenarios—such as image editing and personalized generation—leading to severe defense failure under non-textual inputs. This work introduces the first multimodal concept erasure benchmark for diffusion models, covering three input modalities: text prompts, learned embeddings, and inverted latent variables, across five white-box and black-box attack settings. We propose IRECE, a plug-and-play module that localizes and perturbs target-concept-associated latent variables via cross-attention mechanisms, enabling robust inference-time defense. We further introduce Concept Reproduction Rate (CRR) as a quantitative evaluation metric. Experiments show that state-of-the-art methods achieve CRR >90% under latent-variable white-box attacks, whereas IRECE reduces CRR by up to 40% even in the most stringent settings—without compromising visual fidelity.

Technology Category

Application Category

📝 Abstract

Text-to-image diffusion models may generate harmful or copyrighted content, motivating research on concept erasure. However, existing approaches primarily focus on erasing concepts from text prompts, overlooking other input modalities that are increasingly critical in real-world applications such as image editing and personalized generation. These modalities can become attack surfaces, where erased concepts re-emerge despite defenses. To bridge this gap, we introduce M-ErasureBench, a novel multimodal evaluation framework that systematically benchmarks concept erasure methods across three input modalities: text prompts, learned embeddings, and inverted latents. For the latter two, we evaluate both white-box and black-box access, yielding five evaluation scenarios. Our analysis shows that existing methods achieve strong erasure performance against text prompts but largely fail under learned embeddings and inverted latents, with Concept Reproduction Rate (CRR) exceeding 90% in the white-box setting. To address these vulnerabilities, we propose IRECE (Inference-time Robustness Enhancement for Concept Erasure), a plug-and-play module that localizes target concepts via cross-attention and perturbs the associated latents during denoising. Experiments demonstrate that IRECE consistently restores robustness, reducing CRR by up to 40% under the most challenging white-box latent inversion scenario, while preserving visual quality. To the best of our knowledge, M-ErasureBench provides the first comprehensive benchmark of concept erasure beyond text prompts. Together with IRECE, our benchmark offers practical safeguards for building more reliable protective generative models.

Problem

Research questions and friction points this paper is trying to address.

Evaluates concept erasure methods across text, embeddings, and latents

Addresses vulnerabilities where erased concepts re-emerge via non-text inputs

Proposes a module to enhance robustness against multimodal attacks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal benchmark evaluates text, embeddings, latents

Plug-and-play module localizes concepts via cross-attention

Perturbs latents during denoising to reduce concept reproduction

🔎 Similar Papers

Erasing Conceptual Knowledge from Language Models