VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering

📅 2025-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large Vision-Language Models (LVLMs) still suffer from factual inconsistency in visual question answering, while existing benchmarks lack the capability to independently assess vision and language modalities. To address this, we propose VisualSimpleQA, the first benchmark introducing a modality-decoupled evaluation paradigm. It features a high-quality dataset constructed via human annotation, difficulty-stratified design, and multi-model consensus validation; a challenging subset—VisualSimpleQA-hard—is further derived using rigorously defined difficulty criteria. This enables fine-grained diagnosis of modality-specific failure points in cross-modal reasoning. Evaluation across 15 state-of-the-art LVLMs reveals severe factual reasoning limitations: GPT-4o achieves only ~60% accuracy on the main set and drops sharply to ~30% on the hard subset. These results highlight critical bottlenecks in current LVLMs’ factuality and cross-modal grounding capabilities.

Technology Category

Application Category

📝 Abstract
Large vision-language models (LVLMs) have demonstrated remarkable achievements, yet the generation of non-factual responses remains prevalent in fact-seeking question answering (QA). Current multimodal fact-seeking benchmarks primarily focus on comparing model outputs to ground truth answers, providing limited insights into the performance of modality-specific modules. To bridge this gap, we introduce VisualSimpleQA, a multimodal fact-seeking benchmark with two key features. First, it enables streamlined and decoupled evaluation of LVLMs in visual and linguistic modalities. Second, it incorporates well-defined difficulty criteria to guide human annotation and facilitates the extraction of a challenging subset, VisualSimpleQA-hard. Experiments on 15 LVLMs show that even state-of-the-art models such as GPT-4o achieve merely 60%+ correctness in multimodal fact-seeking QA on VisualSimpleQA and 30%+ on VisualSimpleQA-hard. Furthermore, the decoupled evaluation across these models highlights substantial opportunities for improvement in both visual and linguistic modules. The dataset is available at https://huggingface.co/datasets/WYLing/VisualSimpleQA.
Problem

Research questions and friction points this paper is trying to address.

Evaluates LVLMs in visual and linguistic modalities separately
Addresses non-factual responses in fact-seeking QA tasks
Introduces difficulty criteria for challenging subset creation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled evaluation of visual and linguistic modules
Incorporates difficulty criteria for human annotation
Challenging subset VisualSimpleQA-hard for rigorous testing
🔎 Similar Papers
No similar papers found.
Yanling Wang
Yanling Wang
Zhipu AI
Data MiningNatural Language Processing
Y
Yihan Zhao
Renmin University of China
X
Xiaodong Chen
Renmin University of China
Shasha Guo
Shasha Guo
Renmin University of China
Natural Language ProcessingLarge Language Model
L
Lixin Liu
Tencent
H
Haoyang Li
Renmin University of China
Y
Yong Xiao
Zhongguancun Laboratory, Renmin University of China
J
Jing Zhang
Renmin University of China
Q
Qi Li
Zhongguancun Laboratory, Tsinghua University
K
Ke Xu
Zhongguancun Laboratory, Tsinghua University