🤖 AI Summary
Large Vision-Language Models (LVLMs) still suffer from factual inconsistency in visual question answering, while existing benchmarks lack the capability to independently assess vision and language modalities. To address this, we propose VisualSimpleQA, the first benchmark introducing a modality-decoupled evaluation paradigm. It features a high-quality dataset constructed via human annotation, difficulty-stratified design, and multi-model consensus validation; a challenging subset—VisualSimpleQA-hard—is further derived using rigorously defined difficulty criteria. This enables fine-grained diagnosis of modality-specific failure points in cross-modal reasoning. Evaluation across 15 state-of-the-art LVLMs reveals severe factual reasoning limitations: GPT-4o achieves only ~60% accuracy on the main set and drops sharply to ~30% on the hard subset. These results highlight critical bottlenecks in current LVLMs’ factuality and cross-modal grounding capabilities.
📝 Abstract
Large vision-language models (LVLMs) have demonstrated remarkable achievements, yet the generation of non-factual responses remains prevalent in fact-seeking question answering (QA). Current multimodal fact-seeking benchmarks primarily focus on comparing model outputs to ground truth answers, providing limited insights into the performance of modality-specific modules. To bridge this gap, we introduce VisualSimpleQA, a multimodal fact-seeking benchmark with two key features. First, it enables streamlined and decoupled evaluation of LVLMs in visual and linguistic modalities. Second, it incorporates well-defined difficulty criteria to guide human annotation and facilitates the extraction of a challenging subset, VisualSimpleQA-hard. Experiments on 15 LVLMs show that even state-of-the-art models such as GPT-4o achieve merely 60%+ correctness in multimodal fact-seeking QA on VisualSimpleQA and 30%+ on VisualSimpleQA-hard. Furthermore, the decoupled evaluation across these models highlights substantial opportunities for improvement in both visual and linguistic modules. The dataset is available at https://huggingface.co/datasets/WYLing/VisualSimpleQA.