LiveVQA: Live Visual Knowledge Seeking

📅 2025-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) exhibit severe limitations on multi-hop visual question answering (VQA) tasks requiring up-to-date visual knowledge. Method: We introduce “Real-time Visual Knowledge Question Answering” as a novel benchmark task and construct the first automatically curated VQA dataset targeting time-sensitive visual reasoning—spanning six news websites and 14 categories of trending events, comprising 3,602 single- and multi-hop questions with strict image–text alignment and temporal validity constraints. Our pipeline integrates end-to-end vision–semantic aligned web crawling, multi-hop logical question synthesis, and rigorous quality filtering. Contribution/Results: Comprehensive evaluation across 15 state-of-the-art MLLMs—including GPT-4o and Qwen-2.5-VL—reveals a substantial performance drop on multi-hop visual questions; the best-performing model achieves only 58.2% accuracy. This demonstrates that visual reasoning capability—not textual understanding or tool invocation—is the critical bottleneck in contemporary multimodal reasoning.

Technology Category

Application Category

📝 Abstract
We introduce LiveVQA, an automatically collected dataset of latest visual knowledge from the Internet with synthesized VQA problems. LiveVQA consists of 3,602 single- and multi-hop visual questions from 6 news websites across 14 news categories, featuring high-quality image-text coherence and authentic information. Our evaluation across 15 MLLMs (e.g., GPT-4o, Gemma-3, and Qwen-2.5-VL family) demonstrates that stronger models perform better overall, with advanced visual reasoning capabilities proving crucial for complex multi-hop questions. Despite excellent performance on textual problems, models with tools like search engines still show significant gaps when addressing visual questions requiring latest visual knowledge, highlighting important areas for future research.
Problem

Research questions and friction points this paper is trying to address.

Creating a dataset for visual question answering with latest internet knowledge
Evaluating MLLMs on multi-hop visual reasoning with current visual data
Identifying gaps in models handling visual questions needing up-to-date knowledge
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatically collected latest visual knowledge dataset
Synthesized VQA problems from news websites
Evaluated 15 MLLMs on visual reasoning capabilities
🔎 Similar Papers
No similar papers found.
M
Mingyang Fu
Huazhong University of Science and Technology
Y
Yuyang Peng
Huazhong University of Science and Technology
Benlin Liu
Benlin Liu
University of Washington
Computer VisionMachine LearningVisual Intelligence
Yao Wan
Yao Wan
Huazhong University of Science and Technology
NLPProgramming LanguagesSoftware EngineeringLarge Language Models
D
Dongping Chen
Huazhong University of Science and Technology, University of Washington