Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models

📅 2025-02-03

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Multimodal large language models (MLLMs) exhibit severe robustness deficiencies against visual adversarial attacks, leading to hallucinations, response manipulation, and safety mechanism bypasses. This work proposes a plug-and-play robustification framework that requires no adversarial fine-tuning. We systematically demonstrate, for the first time, that large-scale adversarially pretrained vision encoders—such as Robust ResNet and Robust ViT—can be seamlessly integrated end-to-end into the LLaVA architecture while remaining fully compatible with standard alignment training. Crucially, we find that the language module autonomously adapts to learn robust visual representations, substantially enhancing complex multimodal reasoning capabilities. Experiments show average robustness improvements of 2× on image captioning and 1.5× on visual question answering under adversarial perturbations; jailbreak attack resistance improves by over 10% in accuracy; and critically, no performance degradation is observed on clean data.

Technology Category

Application Category

📝 Abstract

Multi-modal Large Language Models (MLLMs) excel in vision-language tasks but remain vulnerable to visual adversarial perturbations that can induce hallucinations, manipulate responses, or bypass safety mechanisms. Existing methods seek to mitigate these risks by applying constrained adversarial fine-tuning to CLIP vision encoders on ImageNet-scale data, ensuring their generalization ability is preserved. However, this limited adversarial training restricts robustness and broader generalization. In this work, we explore an alternative approach of leveraging existing vision classification models that have been adversarially pre-trained on large-scale data. Our analysis reveals two principal contributions: (1) the extensive scale and diversity of adversarial pre-training enables these models to demonstrate superior robustness against diverse adversarial threats, ranging from imperceptible perturbations to advanced jailbreaking attempts, without requiring additional adversarial training, and (2) end-to-end MLLM integration with these robust models facilitates enhanced adaptation of language components to robust visual features, outperforming existing plug-and-play methodologies on complex reasoning tasks. Through systematic evaluation across visual question-answering, image captioning, and jail-break attacks, we demonstrate that MLLMs trained with these robust models achieve superior adversarial robustness while maintaining favorable clean performance. Our framework achieves 2x and 1.5x average robustness gains in captioning and VQA tasks, respectively, and delivers over 10% improvement against jailbreak attacks. Code and pretrained models will be available at https://github.com/HashmatShadab/Robust-LLaVA.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Robustness

Adversarial Images

Innovation

Methods, ideas, or system contributions that make the work stand out.

Robust-LLaVA

Pre-trained Image Recognition

Enhanced Multimodal Understanding

🔎 Similar Papers

No similar papers found.