AdvI2I: Adversarial Image Attack on Image-to-Image Diffusion models

📅 2024-10-28
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
This work uncovers a previously overlooked security vulnerability in image-to-image (I2I) diffusion models: adversarial input images can induce the generation of NSFW content without modifying the text prompt, thereby evading text-based safety filters. To address this, we propose the first adversarial image attack framework tailored for I2I diffusion models, comprising latent-space gradient optimization, disentanglement of NSFW concept embeddings, and Safe Latent Diffusion (SLD)–based adversarial analysis. We further introduce an Adaptive variant that conceals NSFW semantics and dynamically adjusts defense evasion strategies, significantly enhancing stealth and robustness. Experiments demonstrate over 92% attack success rate across mainstream I2I models; the generated perturbations are imperceptible to human vision and effectively bypass state-of-the-art safeguards such as SLD.

Technology Category

Application Category

📝 Abstract
Recent advances in diffusion models have significantly enhanced the quality of image synthesis, yet they have also introduced serious safety concerns, particularly the generation of Not Safe for Work (NSFW) content. Previous research has demonstrated that adversarial prompts can be used to generate NSFW content. However, such adversarial text prompts are often easily detectable by text-based filters, limiting their efficacy. In this paper, we expose a previously overlooked vulnerability: adversarial image attacks targeting Image-to-Image (I2I) diffusion models. We propose AdvI2I, a novel framework that manipulates input images to induce diffusion models to generate NSFW content. By optimizing a generator to craft adversarial images, AdvI2I circumvents existing defense mechanisms, such as Safe Latent Diffusion (SLD), without altering the text prompts. Furthermore, we introduce AdvI2I-Adaptive, an enhanced version that adapts to potential countermeasures and minimizes the resemblance between adversarial images and NSFW concept embeddings, making the attack more resilient against defenses. Through extensive experiments, we demonstrate that both AdvI2I and AdvI2I-Adaptive can effectively bypass current safeguards, highlighting the urgent need for stronger security measures to address the misuse of I2I diffusion models.
Problem

Research questions and friction points this paper is trying to address.

Adversarial image attacks on Image-to-Image diffusion models
Bypassing safety filters to generate NSFW content
Circumventing existing defense mechanisms without text modification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial image attacks bypass text filters
Generator optimizes images to induce NSFW content
Adaptive version minimizes NSFW embedding resemblance
Y
Yaopei Zeng
College of Information Sciences and Technology, Pennsylvania State University, University Park, PA 16802
Yuanpu Cao
Yuanpu Cao
Penn State University
Bochuan Cao
Bochuan Cao
The Pennsylvania State University, PhD student
Trustworthy ML
Yurui Chang
Yurui Chang
Ph.D. at The Pennsylvania State University
Jinghui Chen
Jinghui Chen
Assistant Professor of Information Sciences and Technology, Penn State University
Machine LearningTrustworthy Machine LearningLarge Language Models
L
Lu Lin
College of Information Sciences and Technology, Pennsylvania State University, University Park, PA 16802