Analyzing the Sensitivity of Vision Language Models in Visual Question Answering

📅 2025-07-28

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This study investigates the sensitivity of vision-language models (VLMs) to violations of Grice’s conversational maxims in visual question answering (VQA), probing whether they exhibit human-like conversational robustness. Method: We introduce a novel diagnostic paradigm that systematically injects semantically redundant or logically disruptive modifiers into VQA v2.0 questions to explicitly violate Gricean cooperative principles—particularly the Maxims of Quantity and Relation. Contribution/Results: Evaluating three state-of-the-art VLMs—GPT-4o, Claude-3.5-Sonnet, and Gemini-1.5-Flash—we observe significant and consistent performance degradation under modification (average accuracy drops of 8.2–14.7%), revealing critical weaknesses in pragmatic, non-literal reasoning. To our knowledge, this is the first systematic diagnosis of VLM conversational robustness grounded in Gricean pragmatics. Our modifier-based probing method provides a scalable, interpretable, and theoretically grounded tool for evaluating and improving VLMs’ pragmatic competence.

Technology Category

Application Category

📝 Abstract

We can think of Visual Question Answering as a (multimodal) conversation between a human and an AI system. Here, we explore the sensitivity of Vision Language Models (VLMs) through the lens of cooperative principles of conversation proposed by Grice. Specifically, even when Grice's maxims of conversation are flouted, humans typically do not have much difficulty in understanding the conversation even though it requires more cognitive effort. Here, we study if VLMs are capable of handling violations to Grice's maxims in a manner that is similar to humans. Specifically, we add modifiers to human-crafted questions and analyze the response of VLMs to these modifiers. We use three state-of-the-art VLMs in our study, namely, GPT-4o, Claude-3.5-Sonnet and Gemini-1.5-Flash on questions from the VQA v2.0 dataset. Our initial results seem to indicate that the performance of VLMs consistently diminish with the addition of modifiers which indicates our approach as a promising direction to understand the limitations of VLMs.

Problem

Research questions and friction points this paper is trying to address.

Assessing VLMs' sensitivity to Grice's maxim violations

Evaluating VLMs' performance on modified VQA questions

Identifying limitations of VLMs in conversational understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing VLMs using Grice's conversational maxims

Testing VLMs with modified human-crafted questions

Evaluating GPT-4o, Claude-3.5, Gemini-1.5 on VQA

🔎 Similar Papers

No similar papers found.