DiffCAP: Diffusion-based Cumulative Adversarial Purification for Vision Language Models

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models (VLMs) are highly vulnerable to imperceptible adversarial perturbations, leading to semantic misinterpretation and severely undermining their deployment reliability. To address this, we propose a diffusion-based cumulative adversarial purification method. Our approach introduces a novel embedding-space similarity-driven dynamic noise injection and denoising mechanism, using VLM feature consistency as the termination criterion—thereby substantially reducing sensitivity to hyperparameters and diffusion steps. By synergistically integrating cumulative stochastic Gaussian noise injection with pretrained diffusion model-based reverse denoising, our method achieves both enhanced robustness and improved computational efficiency. Extensive evaluation across six benchmarks, three VLM architectures, and three downstream tasks demonstrates consistent superiority over state-of-the-art defenses: average accuracy improves significantly, and the method exhibits strong generalization against adversarial attacks of varying intensities.

Technology Category

Application Category

📝 Abstract
Vision Language Models (VLMs) have shown remarkable capabilities in multimodal understanding, yet their susceptibility to perturbations poses a significant threat to their reliability in real-world applications. Despite often being imperceptible to humans, these perturbations can drastically alter model outputs, leading to erroneous interpretations and decisions. This paper introduces DiffCAP, a novel diffusion-based purification strategy that can effectively neutralize adversarial corruptions in VLMs. We observe that adding minimal noise to an adversarially corrupted image significantly alters its latent embedding with respect to VLMs. Building on this insight, DiffCAP cumulatively injects random Gaussian noise into adversarially perturbed input data. This process continues until the embeddings of two consecutive noisy images reach a predefined similarity threshold, indicating a potential approach to neutralize the adversarial effect. Subsequently, a pretrained diffusion model is employed to denoise the stabilized image, recovering a clean representation suitable for the VLMs to produce an output. Through extensive experiments across six datasets with three VLMs under varying attack strengths in three task scenarios, we show that DiffCAP consistently outperforms existing defense techniques by a substantial margin. Notably, DiffCAP significantly reduces both hyperparameter tuning complexity and the required diffusion time, thereby accelerating the denoising process. Equipped with strong theoretical and empirical support, DiffCAP provides a robust and practical solution for securely deploying VLMs in adversarial environments.
Problem

Research questions and friction points this paper is trying to address.

Neutralizing adversarial corruptions in Vision Language Models
Reducing hyperparameter tuning complexity in defense techniques
Accelerating denoising process for secure VLM deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

DiffCAP uses diffusion-based purification for VLMs
Cumulative Gaussian noise neutralizes adversarial corruptions
Pretrained diffusion model denoises stabilized images
🔎 Similar Papers
No similar papers found.