On the Natural Robustness of Vision-Language Models Against Visual Perception Attacks in Autonomous Driving

📅 2025-06-13

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

Deep neural network (DNN)-based visual perception systems in autonomous vehicles are vulnerable to adversarial attacks, causing misclassification and safety risks; conventional defenses—e.g., adversarial training—often degrade clean-data accuracy and exhibit poor generalization. Method: We propose Vehicle-specific Vision-Language Models (V2LMs), the first to empirically demonstrate that CLIP-style vision-language models inherently possess strong adversarial robustness across traffic sign recognition (TSR), automatic lane centering (ALC), and vehicle detection (VD), without requiring adversarial training. We introduce two deployment paradigms: Solo (task-specialized) and Tandem (multi-task jointly fine-tuned), with Tandem reducing memory footprint by 42% while preserving robustness. Contribution/Results: V2LMs achieve <8% accuracy drop under diverse adversarial attacks—substantially outperforming conventional DNNs (33–46% drop)—and serve as a plug-and-play module to enhance the security of existing autonomous driving systems.

Technology Category

Application Category

📝 Abstract

Autonomous vehicles (AVs) rely on deep neural networks (DNNs) for critical tasks such as traffic sign recognition (TSR), automated lane centering (ALC), and vehicle detection (VD). However, these models are vulnerable to attacks that can cause misclassifications and compromise safety. Traditional defense mechanisms, including adversarial training, often degrade benign accuracy and fail to generalize against unseen attacks. In this work, we introduce Vehicle Vision Language Models (V2LMs), fine-tuned vision-language models specialized for AV perception. Our findings demonstrate that V2LMs inherently exhibit superior robustness against unseen attacks without requiring adversarial training, maintaining significantly higher accuracy than conventional DNNs under adversarial conditions. We evaluate two deployment strategies: Solo Mode, where individual V2LMs handle specific perception tasks, and Tandem Mode, where a single unified V2LM is fine-tuned for multiple tasks simultaneously. Experimental results reveal that DNNs suffer performance drops of 33% to 46% under attacks, whereas V2LMs maintain adversarial accuracy with reductions of less than 8% on average. The Tandem Mode further offers a memory-efficient alternative while achieving comparable robustness to Solo Mode. We also explore integrating V2LMs as parallel components to AV perception to enhance resilience against adversarial threats. Our results suggest that V2LMs offer a promising path toward more secure and resilient AV perception systems.

Problem

Research questions and friction points this paper is trying to address.

Enhancing robustness of AV perception against adversarial attacks

Reducing performance drop in DNNs under attack conditions

Exploring memory-efficient deployment strategies for vision-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned vision-language models for AV perception

Inherent robustness against unseen adversarial attacks

Memory-efficient Tandem Mode for multiple tasks

🔎 Similar Papers

Cross-Modal Safety Alignment: Is textual unlearning all you need?