On the Reliability of Vision-Language Models Under Adversarial Frequency-Domain Perturbations

📅 2025-07-30

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This paper reveals a critical vulnerability of vision-language models (VLMs) to adversarial perturbations in the frequency domain, specifically undermining authenticity detection and automatic image captioning. Method: We propose a novel frequency-domain attack that applies small, structured perturbations in the Fourier space—preserving semantic content and spatial-domain imperceptibility—while ensuring cross-model (five mainstream VLMs) and cross-dataset (ten diverse image benchmarks) black-box transferability. Contribution/Results: Our systematic evaluation demonstrates that such perturbations consistently degrade VLM performance across tasks, exposing their overreliance on non-semantic, high-frequency cues. All tested VLMs exhibit significant accuracy drops, confirming a widespread lack of frequency-domain robustness. This work establishes frequency-space analysis as a new dimension for VLM security assessment and provides empirical grounding for developing robust defenses against spectral-domain threats.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) are increasingly used as perceptual modules for visual content reasoning, including through captioning and DeepFake detection. In this work, we expose a critical vulnerability of VLMs when exposed to subtle, structured perturbations in the frequency domain. Specifically, we highlight how these feature transformations undermine authenticity/DeepFake detection and automated image captioning tasks. We design targeted image transformations, operating in the frequency domain to systematically adjust VLM outputs when exposed to frequency-perturbed real and synthetic images. We demonstrate that the perturbation injection method generalizes across five state-of-the-art VLMs which includes different-parameter Qwen2/2.5 and BLIP models. Experimenting across ten real and generated image datasets reveals that VLM judgments are sensitive to frequency-based cues and may not wholly align with semantic content. Crucially, we show that visually-imperceptible spatial frequency transformations expose the fragility of VLMs deployed for automated image captioning and authenticity detection tasks. Our findings under realistic, black-box constraints challenge the reliability of VLMs, underscoring the need for robust multimodal perception systems.

Problem

Research questions and friction points this paper is trying to address.

VLMs' vulnerability to frequency-domain adversarial perturbations

Impact of perturbations on DeepFake detection and captioning

Fragility of VLMs in black-box real-world scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Frequency-domain perturbations to test VLMs

Targeted image transformations in frequency domain

Black-box evaluation of VLM reliability

🔎 Similar Papers

Cross-Modal Safety Alignment: Is textual unlearning all you need?