🤖 AI Summary
Vision-language models exhibit insufficient robustness against adversarial image perturbations. To address this, we propose the “Dual Visual Defense” framework, comprising two stages: (1) end-to-end adversarial vision-language joint pretraining, and (2) adversarial visual instruction tuning. The former introduces the first large-scale, web-data-driven adversarial multimodal pretraining; the latter enhances instruction tuning via adversarial sample augmentation to improve zero-shot generalization and multi-task robustness. Our ΔCLIP achieves approximately 20% higher adversarial accuracy on ImageNet-1K compared to standard CLIP. Δ²LLaVA demonstrates robustness gains of 30% in image captioning and 20% in visual question answering, while simultaneously mitigating hallucination, strengthening reasoning capabilities, and improving zero-shot recognition performance.
📝 Abstract
This paper investigates the robustness of vision-language models against adversarial visual perturbations and introduces a novel ``double visual defense"to enhance this robustness. Unlike previous approaches that resort to lightweight adversarial fine-tuning of a pre-trained CLIP model, we perform large-scale adversarial vision-language pre-training from scratch using web-scale data. We then strengthen the defense by incorporating adversarial visual instruction tuning. The resulting models from each stage, $Delta$CLIP and $Delta^2$LLaVA, show substantially enhanced zero-shot robustness and set a new state-of-the-art in adversarial defense for vision-language models. For example, the adversarial robustness of $Delta$CLIP surpasses that of the previous best models on ImageNet-1k by ~20%. %For example, $Delta$CLIP surpasses the previous best models on ImageNet-1k by ~20% in terms of adversarial robustness. Similarly, compared to prior art, $Delta^2$LLaVA brings a ~30% robustness improvement to image captioning task and a ~20% robustness improvement to visual question answering task. Furthermore, our models exhibit stronger zero-shot recognition capability, fewer hallucinations, and superior reasoning performance compared to baselines. Our project page is https://doublevisualdefense.github.io/.