Adversarial-Guided Diffusion for Multimodal LLM Attacks

📅 2025-07-30

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the challenge of generating adversarial images that fool multimodal large language models (MLLMs) while preserving semantic fidelity. We propose Adversarial-Guided Noise (AGN), a mechanism that injects target semantics into the noise prior of diffusion models, enabling adversarial perturbations with full-spectrum characteristics—balancing fine-grained controllability and high-frequency robustness. By integrating AGN into the diffusion inversion process, our method efficiently generates semantically preserved adversarial examples. Experiments demonstrate that our approach achieves significantly higher attack success rates than state-of-the-art methods across multiple mainstream MLLMs. Moreover, it exhibits superior robustness against common defenses, including input transformations and feature purification. This work establishes a novel paradigm for security evaluation of MLLMs, advancing both the effectiveness and resilience of adversarial attacks in multimodal settings.

Technology Category

Application Category

📝 Abstract

This paper addresses the challenge of generating adversarial image using a diffusion model to deceive multimodal large language models (MLLMs) into generating the targeted responses, while avoiding significant distortion of the clean image. To address the above challenges, we propose an adversarial-guided diffusion (AGD) approach for adversarial attack MLLMs. We introduce adversarial-guided noise to ensure attack efficacy. A key observation in our design is that, unlike most traditional adversarial attacks which embed high-frequency perturbations directly into the clean image, AGD injects target semantics into the noise component of the reverse diffusion. Since the added noise in a diffusion model spans the entire frequency spectrum, the adversarial signal embedded within it also inherits this full-spectrum property. Importantly, during reverse diffusion, the adversarial image is formed as a linear combination of the clean image and the noise. Thus, when applying defenses such as a simple low-pass filtering, which act independently on each component, the adversarial image within the noise component is less likely to be suppressed, as it is not confined to the high-frequency band. This makes AGD inherently robust to variety defenses. Extensive experiments demonstrate that our AGD outperforms state-of-the-art methods in attack performance as well as in model robustness to some defenses.

Problem

Research questions and friction points this paper is trying to address.

Generating adversarial images to deceive multimodal LLMs

Avoiding significant distortion of clean images during attacks

Ensuring robustness against various defense mechanisms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial-guided noise ensures attack efficacy

Injects target semantics into reverse diffusion noise

Inherently robust to various defense mechanisms

🔎 Similar Papers

JailGuard: A Universal Detection Framework for LLM Prompt-based Attacks