Analysing the Robustness of Vision-Language-Models to Common Corruptions

📅 2025-04-18

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This study systematically evaluates the robustness of vision-language models (VLMs) under 19 common image corruptions, focusing on their impact on scene text understanding and object reasoning. We extend the ImageNet-C corruption taxonomy—originally designed for unimodal classification—to multimodal tasks, introducing two dedicated benchmarks: TextVQA-C and GQA-C. Leveraging frequency-domain analysis, we reveal that Transformer-based VLMs exhibit task-specific sensitivity to corruptions due to an inherent low-frequency bias: text recognition degrades most severely under snow and motion blur, whereas object reasoning is most vulnerable to frost and impulse noise. Our work delivers the first comprehensive cross-corruption robustness assessment for VLMs, uncovering fundamental architectural limitations in handling degraded visual inputs. Beyond benchmark construction, we provide theoretical insights and practical guidelines for designing and optimizing robust VLMs for real-world deployment.

Technology Category

Application Category

📝 Abstract

Vision-language models (VLMs) have demonstrated impressive capabilities in understanding and reasoning about visual and textual content. However, their robustness to common image corruptions remains under-explored. In this work, we present the first comprehensive analysis of VLM robustness across 19 corruption types from the ImageNet-C benchmark, spanning four categories: noise, blur, weather, and digital distortions. We introduce two new benchmarks, TextVQA-C and GQA-C, to systematically evaluate how corruptions affect scene text understanding and object-based reasoning, respectively. Our analysis reveals that transformer-based VLMs exhibit distinct vulnerability patterns across tasks: text recognition deteriorates most severely under blur and snow corruptions, while object reasoning shows higher sensitivity to corruptions such as frost and impulse noise. We connect these observations to the frequency-domain characteristics of different corruptions, revealing how transformers' inherent bias toward low-frequency processing explains their differential robustness patterns. Our findings provide valuable insights for developing more corruption-robust vision-language models for real-world applications.

Problem

Research questions and friction points this paper is trying to address.

Assessing VLM robustness to diverse image corruptions

Evaluating corruption impact on text and object understanding

Analyzing frequency-domain bias in transformer-based VLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive analysis of VLM robustness to 19 corruptions

Introduce TextVQA-C and GQA-C benchmarks for evaluation

Link transformer vulnerabilities to frequency-domain characteristics

🔎 Similar Papers

Healing Powers of BERT: How Task-Specific Fine-Tuning Recovers Corrupted Language Models