What Makes VLMs Robust? Towards Reconciling Robustness and Accuracy in Vision-Language Models

📅 2026-03-13

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the inherent trade-off in vision-language models where enhancing adversarial robustness often degrades clean accuracy. The study reveals that adversarial robustness primarily stems from shallow network layers, driven by low-frequency spectral bias and input-insensitive attention mechanisms, while fine-tuning deeper layers harms generalization. To reconcile this tension, the authors propose R-Adapt, a framework that introduces lightweight adaptation modules only in the initial layers, freezes all pre-trained weights, and achieves joint optimization of robustness and accuracy without end-to-end training. R-Adapt supports multiple paradigms—including training-free, model-guided, and data-driven approaches—and establishes state-of-the-art performance across 18 datasets, significantly improving the adversarial robustness of large vision-language models such as LLaVA and Qwen-VL while preserving high clean accuracy.

Technology Category

Application Category

📝 Abstract

Achieving adversarial robustness in Vision-Language Models (VLMs) inevitably compromises accuracy on clean data, presenting a long-standing and challenging trade-off. In this work, we revisit this trade-off by investigating a fundamental question: What makes VLMs robust? Through a detailed analysis of adversarially fine-tuned models, we examine how robustness mechanisms function internally and how they interact with clean accuracy. Our analysis reveals that adversarial robustness is not uniformly distributed across network depth. Instead, unexpectedly, it is primarily localized within the shallow layers, driven by a low-frequency spectral bias and input-insensitive attention patterns. Meanwhile, updates to the deep layers tend to undermine both clean accuracy and robust generalization. Motivated by these insights, we propose Adversarial Robustness Adaptation (R-Adapt), a simple yet effective framework that freezes all pre-trained weights and introduces minimal, insight-driven adaptations only in the initial layers. This design achieves an exceptional balance between adversarial robustness and clean accuracy. R-Adapt further supports training-free, model-guided, and data-driven paradigms, offering flexible pathways to seamlessly equip standard models with robustness. Extensive evaluations on 18 datasets and diverse tasks demonstrate our state-of-the-art performance under various attacks. Notably, R-Adapt generalizes efficiently to large vision-language models (e.g., LLaVA and Qwen-VL) to enhance their robustness. Our project page is available at https://summu77.github.io/R-Adapt.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models

Adversarial Robustness

Accuracy Trade-off

Robustness-Accuracy Balance

Innovation

Methods, ideas, or system contributions that make the work stand out.

adversarial robustness

vision-language models

shallow-layer adaptation

spectral bias