🤖 AI Summary
Current vision-language models exhibit limited performance on tasks requiring fine-grained visual understanding coupled with multi-hop knowledge reasoning, and existing benchmarks lack a unified evaluation framework for these joint capabilities. To address this gap, this work proposes Pix2Fact—a novel benchmark that integrates high-resolution (4K+) images, expert-designed multi-hop fact-checking questions, and explicit dependencies on external knowledge. Pix2Fact is the first to jointly assess fine-grained visual grounding and knowledge-intensive multi-hop reasoning within a single evaluation paradigm. Comprehensive evaluations of nine state-of-the-art models, including Gemini-1.5 Pro and GPT-4, reveal that even the best-performing model achieves only 24.0% accuracy—substantially below human performance at 56%—highlighting a significant deficit in current models’ ability to emulate human-like visual understanding.
📝 Abstract
Despite progress on general tasks, VLMs struggle with challenges demanding both detailed visual grounding and deliberate knowledge-based reasoning, a synergy not captured by existing benchmarks that evaluate these skills separately. To close this gap, we introduce Pix2Fact, a new visual question-answering benchmark designed to evaluate expert-level perception and knowledge-intensive multi-hop reasoning. Pix2Fact contains 1,000 high-resolution (4K+) images spanning 8 daily-life scenarios and situations, with questions and answers meticulously crafted by annotators holding PhDs from top global universities working in partnership with a professional data annotation firm. Each question requires detailed visual grounding, multi-hop reasoning, and the integration of external knowledge to answer. Our evaluation of 9 state-of-the-art VLMs, including proprietary models like Gemini-3-Pro and GPT-5, reveals the substantial challenge posed by Pix2Fact: the most advanced model achieves only 24.0% average accuracy, in stark contrast to human performance of 56%. This significant gap underscores the limitations of current models in replicating human-level visual comprehension. We believe Pix2Fact will serve as a critical benchmark to drive the development of next-generation multimodal agents that combine fine-grained perception with robust, knowledge-based reasoning.