HQA-VLAttack: Towards High Quality Adversarial Attack on Vision-Language Pre-Trained Models

📅 2026-04-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

219K/year
🤖 AI Summary
Existing black-box vision-language adversarial attack methods suffer from low query efficiency and neglect the impact of negative sample pairs on similarity, limiting their effectiveness. This work proposes HQA-VLAttack, a two-stage framework that efficiently generates adversarial examples under strict black-box settings—accessing only model predictions. In the textual stage, it preserves semantic consistency through anti-overfitting word embeddings; in the visual stage, it employs layer-importance-guided initialization and introduces contrastive learning to simultaneously reduce similarity for positive image-text pairs and enhance it for negative pairs. Notably, this approach is the first to jointly optimize the contrastive relationships of both positive and negative image-text pairs, circumventing complex iterative search. Extensive experiments demonstrate that HQA-VLAttack significantly outperforms strong baselines across three benchmark datasets, achieving substantially higher attack success rates and query efficiency.

Technology Category

Application Category

📝 Abstract
Black-box adversarial attack on vision-language pre-trained models is a practical and challenging task, as text and image perturbations need to be considered simultaneously, and only the predicted results are accessible. Research on this problem is in its infancy, and only a handful of methods are available. Nevertheless, existing methods either rely on a complex iterative cross-search strategy, which inevitably consumes numerous queries, or only consider reducing the similarity of positive image-text pairs but ignore that of negative ones, which will also be implicitly diminished, thus inevitably affecting the attack performance. To alleviate the above issues, we propose a simple yet effective framework to generate high-quality adversarial examples on vision-language pre-trained models, named HQA-VLAttack, which consists of text and image attack stages. For text perturbation generation, it leverages the counter-fitting word vector to generate the substitute word set, thus guaranteeing the semantic consistency between the substitute word and the original word. For image perturbation generation, it first initializes the image adversarial example via the layer-importance guided strategy, and then utilizes contrastive learning to optimize the image adversarial perturbation, which ensures that the similarity of positive image-text pairs is decreased while that of negative image-text pairs is increased. In this way, the optimized adversarial images and texts are more likely to retrieve negative examples, thereby enhancing the attack success rate. Experimental results on three benchmark datasets demonstrate that HQA-VLAttack significantly outperforms strong baselines in terms of attack success rate.
Problem

Research questions and friction points this paper is trying to address.

adversarial attack
vision-language pre-trained models
black-box
multimodal perturbation
attack success rate
Innovation

Methods, ideas, or system contributions that make the work stand out.

adversarial attack
vision-language models
counter-fitting word vectors
contrastive learning
black-box attack