VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering

📅 2025-03-09

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

Large Vision-Language Models (LVLMs) still suffer from factual inconsistency in visual question answering, while existing benchmarks lack the capability to independently assess vision and language modalities. To address this, we propose VisualSimpleQA, the first benchmark introducing a modality-decoupled evaluation paradigm. It features a high-quality dataset constructed via human annotation, difficulty-stratified design, and multi-model consensus validation; a challenging subset—VisualSimpleQA-hard—is further derived using rigorously defined difficulty criteria. This enables fine-grained diagnosis of modality-specific failure points in cross-modal reasoning. Evaluation across 15 state-of-the-art LVLMs reveals severe factual reasoning limitations: GPT-4o achieves only ~60% accuracy on the main set and drops sharply to ~30% on the hard subset. These results highlight critical bottlenecks in current LVLMs’ factuality and cross-modal grounding capabilities.

Technology Category

Application Category

📝 Abstract

Large vision-language models (LVLMs) have demonstrated remarkable achievements, yet the generation of non-factual responses remains prevalent in fact-seeking question answering (QA). Current multimodal fact-seeking benchmarks primarily focus on comparing model outputs to ground truth answers, providing limited insights into the performance of modality-specific modules. To bridge this gap, we introduce VisualSimpleQA, a multimodal fact-seeking benchmark with two key features. First, it enables streamlined and decoupled evaluation of LVLMs in visual and linguistic modalities. Second, it incorporates well-defined difficulty criteria to guide human annotation and facilitates the extraction of a challenging subset, VisualSimpleQA-hard. Experiments on 15 LVLMs show that even state-of-the-art models such as GPT-4o achieve merely 60%+ correctness in multimodal fact-seeking QA on VisualSimpleQA and 30%+ on VisualSimpleQA-hard. Furthermore, the decoupled evaluation across these models highlights substantial opportunities for improvement in both visual and linguistic modules. The dataset is available at https://huggingface.co/datasets/WYLing/VisualSimpleQA.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LVLMs in visual and linguistic modalities separately

Addresses non-factual responses in fact-seeking QA tasks

Introduces difficulty criteria for challenging subset creation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled evaluation of visual and linguistic modules

Incorporates difficulty criteria for human annotation

Challenging subset VisualSimpleQA-hard for rigorous testing

🔎 Similar Papers

No similar papers found.