🤖 AI Summary
This paper reveals a critical vulnerability of vision-language models (VLMs) to adversarial perturbations in the frequency domain, specifically undermining authenticity detection and automatic image captioning. Method: We propose a novel frequency-domain attack that applies small, structured perturbations in the Fourier space—preserving semantic content and spatial-domain imperceptibility—while ensuring cross-model (five mainstream VLMs) and cross-dataset (ten diverse image benchmarks) black-box transferability. Contribution/Results: Our systematic evaluation demonstrates that such perturbations consistently degrade VLM performance across tasks, exposing their overreliance on non-semantic, high-frequency cues. All tested VLMs exhibit significant accuracy drops, confirming a widespread lack of frequency-domain robustness. This work establishes frequency-space analysis as a new dimension for VLM security assessment and provides empirical grounding for developing robust defenses against spectral-domain threats.
📝 Abstract
Vision-Language Models (VLMs) are increasingly used as perceptual modules for visual content reasoning, including through captioning and DeepFake detection. In this work, we expose a critical vulnerability of VLMs when exposed to subtle, structured perturbations in the frequency domain. Specifically, we highlight how these feature transformations undermine authenticity/DeepFake detection and automated image captioning tasks. We design targeted image transformations, operating in the frequency domain to systematically adjust VLM outputs when exposed to frequency-perturbed real and synthetic images. We demonstrate that the perturbation injection method generalizes across five state-of-the-art VLMs which includes different-parameter Qwen2/2.5 and BLIP models. Experimenting across ten real and generated image datasets reveals that VLM judgments are sensitive to frequency-based cues and may not wholly align with semantic content. Crucially, we show that visually-imperceptible spatial frequency transformations expose the fragility of VLMs deployed for automated image captioning and authenticity detection tasks. Our findings under realistic, black-box constraints challenge the reliability of VLMs, underscoring the need for robust multimodal perception systems.