Double Visual Defense: Adversarial Pre-training and Instruction Tuning for Improving Vision-Language Model Robustness

📅 2025-01-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models exhibit insufficient robustness against adversarial image perturbations. To address this, we propose the “Dual Visual Defense” framework, comprising two stages: (1) end-to-end adversarial vision-language joint pretraining, and (2) adversarial visual instruction tuning. The former introduces the first large-scale, web-data-driven adversarial multimodal pretraining; the latter enhances instruction tuning via adversarial sample augmentation to improve zero-shot generalization and multi-task robustness. Our ΔCLIP achieves approximately 20% higher adversarial accuracy on ImageNet-1K compared to standard CLIP. Δ²LLaVA demonstrates robustness gains of 30% in image captioning and 20% in visual question answering, while simultaneously mitigating hallucination, strengthening reasoning capabilities, and improving zero-shot recognition performance.

Technology Category

Application Category

📝 Abstract
This paper investigates the robustness of vision-language models against adversarial visual perturbations and introduces a novel ``double visual defense"to enhance this robustness. Unlike previous approaches that resort to lightweight adversarial fine-tuning of a pre-trained CLIP model, we perform large-scale adversarial vision-language pre-training from scratch using web-scale data. We then strengthen the defense by incorporating adversarial visual instruction tuning. The resulting models from each stage, $Delta$CLIP and $Delta^2$LLaVA, show substantially enhanced zero-shot robustness and set a new state-of-the-art in adversarial defense for vision-language models. For example, the adversarial robustness of $Delta$CLIP surpasses that of the previous best models on ImageNet-1k by ~20%. %For example, $Delta$CLIP surpasses the previous best models on ImageNet-1k by ~20% in terms of adversarial robustness. Similarly, compared to prior art, $Delta^2$LLaVA brings a ~30% robustness improvement to image captioning task and a ~20% robustness improvement to visual question answering task. Furthermore, our models exhibit stronger zero-shot recognition capability, fewer hallucinations, and superior reasoning performance compared to baselines. Our project page is https://doublevisualdefense.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Visual Language Models
Robustness
Interference Images
Innovation

Methods, ideas, or system contributions that make the work stand out.

double-layer defense
adversarial robustness
vision-language models