Questioning the Stability of Visual Question Answering

📅 2025-11-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically investigates the stability of vision-language models (VLMs) under semantically preserving, minimal perturbations. Method: We conduct the first large-scale robustness evaluation across multiple state-of-the-art VLMs (e.g., GPT-4o, Gemini 2.0 Flash) and diverse datasets, applying benign image perturbations (pixel shifts, geometric transformations, padding-rescaling) and text perturbations (question paraphrasing, multilingual rewrites). Contribution/Results: Experiments reveal that mainstream VLMs exhibit high sensitivity even to imperceptible perturbations, with stability strongly correlated with answer correctness. Building on this, we propose a novel paradigm leveraging small-model stability patterns to predict the correctness of large-model answers—achieving high-accuracy reliability assessment. This work uncovers a fundamental fragility in current VLMs, establishes the first evaluation framework for VQA stability under semantic-invariant perturbations, and introduces interpretable, reusable metrics for trustworthy VLM assessment.

Technology Category

Application Category

📝 Abstract
Visual Language Models (VLMs) have achieved remarkable progress, yet their reliability under small, meaning-preserving input changes remains poorly understood. We present the first large-scale, systematic study of VLM robustness to benign visual and textual perturbations: pixel-level shifts, light geometric transformations, padded rescaling, paraphrasing, and multilingual rewrites that do not alter the underlying semantics of an image-question pair. Across a broad set of models and datasets, we find that modern VLMs are highly sensitive to such minor perturbations: a substantial fraction of samples change their predicted answer under at least one visual or textual modification. We characterize how this instability varies across perturbation types, question categories, and models, revealing that even state-of-the-art systems (e.g., GPT-4o, Gemini 2.0 Flash) frequently fail under shifts as small as a few pixels or harmless rephrasings. We further show that sample-level stability serves as a strong indicator of correctness: stable samples are consistently far more likely to be answered correctly. Leveraging this, we demonstrate that the stability patterns of small, accessible open-source models can be used to predict the correctness of much larger closed-source models with high precision. Our findings expose a fundamental fragility in current VLMs and highlight the need for robustness evaluations that go beyond adversarial perturbations, focusing instead on invariances that models should reliably uphold.
Problem

Research questions and friction points this paper is trying to address.

Evaluating VLM robustness to minor visual and textual perturbations
Characterizing instability across perturbation types and models
Using stability patterns to predict model correctness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic study of VLM robustness to perturbations
Characterizing instability across models and question types
Using stability patterns to predict model correctness
🔎 Similar Papers
No similar papers found.