From Pixels to Facts (Pix2Fact): Benchmarking Multi-Hop Reasoning for Fine-Grained Visual Fact Checking

📅 2026-01-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models exhibit limited performance on tasks requiring fine-grained visual understanding coupled with multi-hop knowledge reasoning, and existing benchmarks lack a unified evaluation framework for these joint capabilities. To address this gap, this work proposes Pix2Fact—a novel benchmark that integrates high-resolution (4K+) images, expert-designed multi-hop fact-checking questions, and explicit dependencies on external knowledge. Pix2Fact is the first to jointly assess fine-grained visual grounding and knowledge-intensive multi-hop reasoning within a single evaluation paradigm. Comprehensive evaluations of nine state-of-the-art models, including Gemini-1.5 Pro and GPT-4, reveal that even the best-performing model achieves only 24.0% accuracy—substantially below human performance at 56%—highlighting a significant deficit in current models’ ability to emulate human-like visual understanding.

Technology Category

Application Category

📝 Abstract
Despite progress on general tasks, VLMs struggle with challenges demanding both detailed visual grounding and deliberate knowledge-based reasoning, a synergy not captured by existing benchmarks that evaluate these skills separately. To close this gap, we introduce Pix2Fact, a new visual question-answering benchmark designed to evaluate expert-level perception and knowledge-intensive multi-hop reasoning. Pix2Fact contains 1,000 high-resolution (4K+) images spanning 8 daily-life scenarios and situations, with questions and answers meticulously crafted by annotators holding PhDs from top global universities working in partnership with a professional data annotation firm. Each question requires detailed visual grounding, multi-hop reasoning, and the integration of external knowledge to answer. Our evaluation of 9 state-of-the-art VLMs, including proprietary models like Gemini-3-Pro and GPT-5, reveals the substantial challenge posed by Pix2Fact: the most advanced model achieves only 24.0% average accuracy, in stark contrast to human performance of 56%. This significant gap underscores the limitations of current models in replicating human-level visual comprehension. We believe Pix2Fact will serve as a critical benchmark to drive the development of next-generation multimodal agents that combine fine-grained perception with robust, knowledge-based reasoning.
Problem

Research questions and friction points this paper is trying to address.

visual fact checking
multi-hop reasoning
visual grounding
knowledge-intensive reasoning
multimodal benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-hop reasoning
fine-grained visual grounding
visual fact checking
vision-language models
knowledge-intensive VQA
🔎 Similar Papers
No similar papers found.
Y
Yifan Jiang
GADE Union (Global AI Data Experts Union); Shanghai Jiao Tong University, China
C
Cong Zhang
GADE Union (Global AI Data Experts Union); Nanyang Technological University, Singapore
Bofei Zhang
Bofei Zhang
BIGAI
Y
Yifan Yang
Nanyang Technological University, Singapore
Bingzhang Wang
Bingzhang Wang
Ph.D. in Intelligent Transportation, University of Washington
AI in Transportation
Yew-Soon Ong
Yew-Soon Ong
President Chair Professor of Computer Science, A*Star AI Chief Scientist, FIEEE
Artificial IntelligenceStatistical MLEvolutionary OptimizationBayesian Optimization