🤖 AI Summary
Vision-language pretraining (VLP) models exhibit insufficient adversarial robustness in image-text joint tasks, particularly under cross-modal joint perturbations. This work proposes the first gradient-based multimodal adversarial attack method grounded in contrastive learning, capable of simultaneously generating imperceptible adversarial images and texts. Crucially, it introduces a novel joint optimization framework that unifies cross-modal (image-text) and intra-modal contrastive losses, thereby substantially enhancing the transferability of adversarial examples across model architectures in black-box settings. Evaluated on image-text retrieval and visual entailment tasks, the method significantly outperforms both unimodal and state-of-the-art multimodal attacks: it achieves an average transfer success rate improvement of 27.6% over existing approaches, demonstrating superior cross-architecture generalization and efficacy in practical adversarial scenarios.
📝 Abstract
The integration of visual and textual data in Vision-Language Pre-training (VLP) models is crucial for enhancing vision-language understanding. However, the adversarial robustness of these models, especially in the alignment of image-text features, has not yet been sufficiently explored. In this paper, we introduce a novel gradient-based multimodal adversarial attack method, underpinned by contrastive learning, to improve the transferability of multimodal adversarial samples in VLP models. This method concurrently generates adversarial texts and images within imperceptive perturbation, employing both image-text and intra-modal contrastive loss. We evaluate the effectiveness of our approach on image-text retrieval and visual entailment tasks, using publicly available datasets in a black-box setting. Extensive experiments indicate a significant advancement over existing single-modal transfer-based adversarial attack methods and current multimodal adversarial attack approaches.