π€ AI Summary
Existing cross-modal adversarial attacks exhibit poor transferability when evaluating the robustness of vision-language pre-trained (VLP) models, primarily due to over-reliance on model-specific features and local image regions. To address this, we propose the first model-agnostic, fine-grained cross-modal adversarial attack framework. Our method introduces two core innovations: (1) RScropβa rescaling-based sliding cropping mechanism that enhances spatial perturbation diversity; and (2) Multi-Granularity Similarity Disruption (MGSD), which jointly models and disrupts fine-grained image-text semantic alignment across modalities. The attack is gradient-based, generating pixel-level image perturbations without requiring access to target model architecture or internal features. Extensive experiments across diverse VLP models (e.g., CLIP, BLIP, ALPRO), benchmarks (Flickr30K, COCO), and downstream tasks (retrieval, VQA) demonstrate significant improvements in both attack success rate and cross-model transferability, achieving state-of-the-art performance. This work provides a more general and reliable evaluation tool for assessing multimodal model robustness.
π Abstract
Current adversarial attacks for evaluating the robustness of vision-language pre-trained (VLP) models in multi-modal tasks suffer from limited transferability, where attacks crafted for a specific model often struggle to generalize effectively across different models, limiting their utility in assessing robustness more broadly. This is mainly attributed to the over-reliance on model-specific features and regions, particularly in the image modality. In this paper, we propose an elegant yet highly effective method termed Meticulous Adversarial Attack (MAA) to fully exploit model-independent characteristics and vulnerabilities of individual samples, achieving enhanced generalizability and reduced model dependence. MAA emphasizes fine-grained optimization of adversarial images by developing a novel resizing and sliding crop (RScrop) technique, incorporating a multi-granularity similarity disruption (MGSD) strategy. Extensive experiments across diverse VLP models, multiple benchmark datasets, and a variety of downstream tasks demonstrate that MAA significantly enhances the effectiveness and transferability of adversarial attacks. A large cohort of performance studies is conducted to generate insights into the effectiveness of various model configurations, guiding future advancements in this domain.