Hierarchical Refinement of Universal Multimodal Attacks on Vision-Language Models

📅 2026-01-15

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the limited generalizability of existing adversarial attacks on vision-language models, which typically rely on sample-specific perturbations and fail to transfer effectively to new data or scenarios. To overcome this, we propose HRA, a hierarchical optimization framework for universal multimodal adversarial attacks—the first to achieve both high efficiency and strong generalization in vision-language settings. HRA jointly optimizes universal perturbations across image and text modalities by decoupling perturbations from input images, leveraging ScMix data augmentation, applying temporally hierarchical gradient optimization, and generating text perturbations based on intra- and inter-sentence importance. Extensive experiments across multiple models, datasets, and downstream tasks demonstrate that HRA significantly outperforms current methods, achieving superior attack performance with remarkable transferability and efficiency.

Technology Category

Application Category

📝 Abstract

Existing adversarial attacks for VLP models are mostly sample-specific, resulting in substantial computational overhead when scaled to large datasets or new scenarios. To overcome this limitation, we propose Hierarchical Refinement Attack (HRA), a multimodal universal attack framework for VLP models. For the image modality, we refine the optimization path by leveraging a temporal hierarchy of historical and estimated future gradients to avoid local minima and stabilize universal perturbation learning. For the text modality, it hierarchically models textual importance by considering both intra- and inter-sentence contributions to identify globally influential words, which are then used as universal text perturbations. Extensive experiments across various downstream tasks, VLP models, and datasets, demonstrate the superior transferability of the proposed universal multimodal attacks.

Problem

Research questions and friction points this paper is trying to address.

adversarial attacks

vision-language models

universal perturbations

multimodal

computational overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

universal adversarial perturbations

multimodal attack

hierarchical refinement