🤖 AI Summary
Existing black-box video adversarial attacks heavily rely on numerous model queries, making them infeasible for large-scale Video Large Language Models (Video-LLMs); moreover, no prior work directly perturbs the video feature space at the feature map level. Method: We propose the first query-free, feature-map-driven stealthy black-box attack. Leveraging pre-trained models, our method transfers video feature maps and applies targeted perturbations directly in the feature space—without any interaction with the target model—to generate high-fidelity adversarial videos. It jointly optimizes feature transferability and spatiotemporal perceptual quality (measured via SSIM, PSNR, and temporal inconsistency). Contribution/Results: Our approach achieves >70% attack success rates on conventional video classifiers and, for the first time, effectively misleads Video-LLMs. By eliminating query dependency, it establishes a new benchmark for security evaluation of video foundation models.
📝 Abstract
The vulnerability of deep neural networks (DNNs) has been preliminarily verified. Existing black-box adversarial attacks usually require multi-round interaction with the model and consume numerous queries, which is impractical in the real-world and hard to scale to recently emerged Video-LLMs. Moreover, no attack in the video domain directly leverages feature maps to shift the clean-video feature space. We therefore propose FeatureFool, a stealthy, video-domain, zero-query black-box attack that utilizes information extracted from a DNN to alter the feature space of clean videos. Unlike query-based methods that rely on iterative interaction, FeatureFool performs a zero-query attack by directly exploiting DNN-extracted information. This efficient approach is unprecedented in the video domain. Experiments show that FeatureFool achieves an attack success rate above 70% against traditional video classifiers without any queries. Benefiting from the transferability of the feature map, it can also craft harmful content and bypass Video-LLM recognition. Additionally, adversarial videos generated by FeatureFool exhibit high quality in terms of SSIM, PSNR, and Temporal-Inconsistency, making the attack barely perceptible. This paper may contain violent or explicit content.