Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction

📅 2026-03-05

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the limitations of existing adversarial attacks on vision-language pre-trained models, which rely on static cross-modal interactions and only disrupt positive sample pairs, resulting in weak perturbation efficacy and poor transferability. To overcome these issues, we propose Semantic-Augmented Dynamic Contrastive Attack (SADCA), the first method to incorporate dynamic image-text interactions and a semantic-guided perturbation mechanism. SADCA progressively disrupts cross-modal alignment through contrastive learning and enhances the diversity and generalization of adversarial examples via input transformations. Extensive experiments demonstrate that SADCA significantly improves cross-task and cross-model attack transferability across multiple datasets and architectures, consistently outperforming state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

With the rapid advancement and widespread application of vision-language pre-training (VLP) models, their vulnerability to adversarial attacks has become a critical concern. In general, the adversarial examples can typically be designed to exhibit transferable power, attacking not only different models but also across diverse tasks. However, existing attacks on language-vision models mainly rely on static cross-modal interactions and focus solely on disrupting positive image-text pairs, resulting in limited cross-modal disruption and poor transferability. To address this issue, we propose a Semantic-Augmented Dynamic Contrastive Attack (SADCA) that enhances adversarial transferability through progressive and semantically guided perturbation. SADCA progressively disrupts cross-modal alignment through dynamic interactions between adversarial images and texts. This is accomplished by SADCA establishing a contrastive learning mechanism involving adversarial, positive and negative samples, to reinforce the semantic inconsistency of the obtained perturbations. Moreover, we empirically find that input transformations commonly used in traditional transfer-based attacks also benefit VLPs, which motivates a semantic augmentation module that increases the diversity and generalization of adversarial examples. Extensive experiments on multiple datasets and models demonstrate that SADCA significantly improves adversarial transferability and consistently surpasses state-of-the-art methods. The code is released at https://github.com/LiYuanBoJNU/SADCA.

Problem

Research questions and friction points this paper is trying to address.

vision-language models

adversarial attack

transferability

cross-modal interaction

semantic inconsistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

adversarial transferability

vision-language models

dynamic contrastive interaction