MAA: Meticulous Adversarial Attack against Vision-Language Pre-trained Models

📅 2025-02-12

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Existing cross-modal adversarial attacks exhibit poor transferability when evaluating the robustness of vision-language pre-trained (VLP) models, primarily due to over-reliance on model-specific features and local image regions. To address this, we propose the first model-agnostic, fine-grained cross-modal adversarial attack framework. Our method introduces two core innovations: (1) RScrop—a rescaling-based sliding cropping mechanism that enhances spatial perturbation diversity; and (2) Multi-Granularity Similarity Disruption (MGSD), which jointly models and disrupts fine-grained image-text semantic alignment across modalities. The attack is gradient-based, generating pixel-level image perturbations without requiring access to target model architecture or internal features. Extensive experiments across diverse VLP models (e.g., CLIP, BLIP, ALPRO), benchmarks (Flickr30K, COCO), and downstream tasks (retrieval, VQA) demonstrate significant improvements in both attack success rate and cross-model transferability, achieving state-of-the-art performance. This work provides a more general and reliable evaluation tool for assessing multimodal model robustness.

Technology Category

Application Category

📝 Abstract

Current adversarial attacks for evaluating the robustness of vision-language pre-trained (VLP) models in multi-modal tasks suffer from limited transferability, where attacks crafted for a specific model often struggle to generalize effectively across different models, limiting their utility in assessing robustness more broadly. This is mainly attributed to the over-reliance on model-specific features and regions, particularly in the image modality. In this paper, we propose an elegant yet highly effective method termed Meticulous Adversarial Attack (MAA) to fully exploit model-independent characteristics and vulnerabilities of individual samples, achieving enhanced generalizability and reduced model dependence. MAA emphasizes fine-grained optimization of adversarial images by developing a novel resizing and sliding crop (RScrop) technique, incorporating a multi-granularity similarity disruption (MGSD) strategy. Extensive experiments across diverse VLP models, multiple benchmark datasets, and a variety of downstream tasks demonstrate that MAA significantly enhances the effectiveness and transferability of adversarial attacks. A large cohort of performance studies is conducted to generate insights into the effectiveness of various model configurations, guiding future advancements in this domain.

Problem

Research questions and friction points this paper is trying to address.

Enhancing adversarial attack transferability across models

Exploiting model-independent sample vulnerabilities effectively

Improving robustness assessment in vision-language tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Meticulous Adversarial Attack method

resizing and sliding crop technique

multi-granularity similarity disruption strategy

🔎 Similar Papers

AnyAttack: Towards Large-scale Self-supervised Adversarial Attacks on Vision-language Models