What Users Leave Unsaid: Under-Specified Queries Limit Vision-Language Models

πŸ“… 2026-01-07
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 1
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses a critical gap in the evaluation of vision-language models (VLMs), which predominantly rely on structured queries that fail to capture their ability to handle informal, underspecified user questions in real-world scenarios. To this end, we introduce HAERAE-Vision, a benchmark comprising 653 authentic ambiguous visual questions sourced from Korean online communities alongside their clarified rewrites, yielding 1,306 total queries. We conduct a large-scale evaluation across 39 state-of-the-art VLMs, including GPT-5 and Gemini 2.5 Pro, revealing for the first time that query underspecification is a key bottleneck: top models achieve below 50% accuracy on original queries, yet gain 8–22 percentage points through explicit rewritingβ€”an improvement more pronounced in smaller models. Notably, even augmenting with web search fails to compensate for performance degradation caused by ambiguity.

Technology Category

Application Category

πŸ“ Abstract
Current vision-language benchmarks predominantly feature well-structured questions with clear, explicit prompts. However, real user queries are often informal and underspecified. Users naturally leave much unsaid, relying on images to convey context. We introduce HAERAE-Vision, a benchmark of 653 real-world visual questions from Korean online communities (0.76% survival from 86K candidates), each paired with an explicit rewrite, yielding 1,306 query variants in total. Evaluating 39 VLMs, we find that even state-of-the-art models (GPT-5, Gemini 2.5 Pro) achieve under 50% on the original queries. Crucially, query explicitation alone yields 8 to 22 point improvements, with smaller models benefiting most. We further show that even with web search, under-specified queries underperform explicit queries without search, revealing that current retrieval cannot compensate for what users leave unsaid. Our findings demonstrate that a substantial portion of VLM difficulty stem from natural query under-specification instead of model capability, highlighting a critical gap between benchmark evaluation and real-world deployment.
Problem

Research questions and friction points this paper is trying to address.

under-specified queries
vision-language models
real-world queries
query explicitation
benchmark gap
Innovation

Methods, ideas, or system contributions that make the work stand out.

under-specified queries
vision-language models
HAERAE-Vision
query explicitation
real-world VQA
πŸ”Ž Similar Papers
No similar papers found.