Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions

📅 2025-07-18

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Visual Question Answering (VQA) suffers from ambiguous user queries, and existing models lack the capability to actively seek clarification. Method: This paper proposes an interactive ambiguity resolution paradigm wherein Vision-Language Models (VLMs) proactively generate clarifying questions to elicit user feedback and resolve ambiguity. To support this paradigm, we introduce ClearVQA—the first interactive clarification benchmark for VQA—covering three canonical ambiguity types: referring expressions, attributes, and relational dependencies, along with a human-AI collaborative evaluation protocol. We further design a multi-stage training framework to mitigate VLMs’ inherent bias toward answer generation over question formulation. Contribution/Results: Experiments demonstrate that integrating active clarification significantly improves model accuracy on ambiguous questions, validating the effectiveness of shifting from passive response to proactive inquiry in VQA.

Technology Category

Application Category

📝 Abstract

In visual question answering (VQA) context, users often pose ambiguous questions to visual language models (VLMs) due to varying expression habits. Existing research addresses such ambiguities primarily by rephrasing questions. These approaches neglect the inherently interactive nature of user interactions with VLMs, where ambiguities can be clarified through user feedback. However, research on interactive clarification faces two major challenges: (1) Benchmarks are absent to assess VLMs' capacity for resolving ambiguities through interaction; (2) VLMs are trained to prefer answering rather than asking, preventing them from seeking clarification. To overcome these challenges, we introduce extbf{ClearVQA} benchmark, which targets three common categories of ambiguity in VQA context, and encompasses various VQA scenarios.

Problem

Research questions and friction points this paper is trying to address.

Resolving ambiguity in visual question answering

Assessing VLMs' capacity for interactive clarification

Overcoming VLMs' reluctance to seek user clarification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing ClearVQA benchmark for ambiguity resolution

Training VLMs to ask clarifying questions interactively

Addressing three common ambiguity categories in VQA

🔎 Similar Papers

Pre-trained Vision-Language Models Learn Discoverable Visual Concepts