🤖 AI Summary
This study systematically evaluates the robustness of vision-language models (VLMs) under 19 common image corruptions, focusing on their impact on scene text understanding and object reasoning. We extend the ImageNet-C corruption taxonomy—originally designed for unimodal classification—to multimodal tasks, introducing two dedicated benchmarks: TextVQA-C and GQA-C. Leveraging frequency-domain analysis, we reveal that Transformer-based VLMs exhibit task-specific sensitivity to corruptions due to an inherent low-frequency bias: text recognition degrades most severely under snow and motion blur, whereas object reasoning is most vulnerable to frost and impulse noise. Our work delivers the first comprehensive cross-corruption robustness assessment for VLMs, uncovering fundamental architectural limitations in handling degraded visual inputs. Beyond benchmark construction, we provide theoretical insights and practical guidelines for designing and optimizing robust VLMs for real-world deployment.
📝 Abstract
Vision-language models (VLMs) have demonstrated impressive capabilities in understanding and reasoning about visual and textual content. However, their robustness to common image corruptions remains under-explored. In this work, we present the first comprehensive analysis of VLM robustness across 19 corruption types from the ImageNet-C benchmark, spanning four categories: noise, blur, weather, and digital distortions. We introduce two new benchmarks, TextVQA-C and GQA-C, to systematically evaluate how corruptions affect scene text understanding and object-based reasoning, respectively. Our analysis reveals that transformer-based VLMs exhibit distinct vulnerability patterns across tasks: text recognition deteriorates most severely under blur and snow corruptions, while object reasoning shows higher sensitivity to corruptions such as frost and impulse noise. We connect these observations to the frequency-domain characteristics of different corruptions, revealing how transformers' inherent bias toward low-frequency processing explains their differential robustness patterns. Our findings provide valuable insights for developing more corruption-robust vision-language models for real-world applications.