π€ AI Summary
This work addresses a critical gap in the evaluation of vision-language models (VLMs), which predominantly rely on structured queries that fail to capture their ability to handle informal, underspecified user questions in real-world scenarios. To this end, we introduce HAERAE-Vision, a benchmark comprising 653 authentic ambiguous visual questions sourced from Korean online communities alongside their clarified rewrites, yielding 1,306 total queries. We conduct a large-scale evaluation across 39 state-of-the-art VLMs, including GPT-5 and Gemini 2.5 Pro, revealing for the first time that query underspecification is a key bottleneck: top models achieve below 50% accuracy on original queries, yet gain 8β22 percentage points through explicit rewritingβan improvement more pronounced in smaller models. Notably, even augmenting with web search fails to compensate for performance degradation caused by ambiguity.
π Abstract
Current vision-language benchmarks predominantly feature well-structured questions with clear, explicit prompts. However, real user queries are often informal and underspecified. Users naturally leave much unsaid, relying on images to convey context. We introduce HAERAE-Vision, a benchmark of 653 real-world visual questions from Korean online communities (0.76% survival from 86K candidates), each paired with an explicit rewrite, yielding 1,306 query variants in total. Evaluating 39 VLMs, we find that even state-of-the-art models (GPT-5, Gemini 2.5 Pro) achieve under 50% on the original queries. Crucially, query explicitation alone yields 8 to 22 point improvements, with smaller models benefiting most. We further show that even with web search, under-specified queries underperform explicit queries without search, revealing that current retrieval cannot compensate for what users leave unsaid. Our findings demonstrate that a substantial portion of VLM difficulty stem from natural query under-specification instead of model capability, highlighting a critical gap between benchmark evaluation and real-world deployment.