Analysing the Robustness of Vision-Language-Models to Common Corruptions

📅 2025-04-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates the robustness of vision-language models (VLMs) under 19 common image corruptions, focusing on their impact on scene text understanding and object reasoning. We extend the ImageNet-C corruption taxonomy—originally designed for unimodal classification—to multimodal tasks, introducing two dedicated benchmarks: TextVQA-C and GQA-C. Leveraging frequency-domain analysis, we reveal that Transformer-based VLMs exhibit task-specific sensitivity to corruptions due to an inherent low-frequency bias: text recognition degrades most severely under snow and motion blur, whereas object reasoning is most vulnerable to frost and impulse noise. Our work delivers the first comprehensive cross-corruption robustness assessment for VLMs, uncovering fundamental architectural limitations in handling degraded visual inputs. Beyond benchmark construction, we provide theoretical insights and practical guidelines for designing and optimizing robust VLMs for real-world deployment.

Technology Category

Application Category

📝 Abstract
Vision-language models (VLMs) have demonstrated impressive capabilities in understanding and reasoning about visual and textual content. However, their robustness to common image corruptions remains under-explored. In this work, we present the first comprehensive analysis of VLM robustness across 19 corruption types from the ImageNet-C benchmark, spanning four categories: noise, blur, weather, and digital distortions. We introduce two new benchmarks, TextVQA-C and GQA-C, to systematically evaluate how corruptions affect scene text understanding and object-based reasoning, respectively. Our analysis reveals that transformer-based VLMs exhibit distinct vulnerability patterns across tasks: text recognition deteriorates most severely under blur and snow corruptions, while object reasoning shows higher sensitivity to corruptions such as frost and impulse noise. We connect these observations to the frequency-domain characteristics of different corruptions, revealing how transformers' inherent bias toward low-frequency processing explains their differential robustness patterns. Our findings provide valuable insights for developing more corruption-robust vision-language models for real-world applications.
Problem

Research questions and friction points this paper is trying to address.

Assessing VLM robustness to diverse image corruptions
Evaluating corruption impact on text and object understanding
Analyzing frequency-domain bias in transformer-based VLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive analysis of VLM robustness to 19 corruptions
Introduce TextVQA-C and GQA-C benchmarks for evaluation
Link transformer vulnerabilities to frequency-domain characteristics
🔎 Similar Papers
No similar papers found.
M
Muhammad Usama
Department of Electrical Engineering, DHA Suffa University, Karachi, Pakistan
S
Syeda Aisha Asim
Department of Electrical Engineering, DHA Suffa University, Karachi, Pakistan
Syed Bilal Ali
Syed Bilal Ali
Sr. Applications Engineer, MSc Communication Engineering (TU Munich))
Magnetic sensorsTelecommunications
Syed Talal Wasim
Syed Talal Wasim
University of Bonn
Computer VisionVideo UnderstandingAction RecognitionMulti-Modal Learning
U
Umair bin Mansoor
Department of Electrical Engineering, DHA Suffa University, Karachi, Pakistan