🤖 AI Summary
Existing visual reasoning benchmarks struggle to disentangle whether the performance of multimodal large language models (MLLMs) stems from genuine visual understanding or reliance on linguistic priors. To address this limitation, this work proposes VisReason, a novel benchmark that systematically defines and constructs vision-centric reasoning tasks in which perception and reasoning are tightly coupled within everyday scenarios. VisReason comprises 1,505 carefully curated questions across 10 categories, meticulously annotated along fine-grained dimensions of perceptual, structural, and conceptual reasoning. Comparative evaluation against human baselines reveals substantial gaps in current MLLMs’ capabilities on such tasks, demonstrating that existing test-time reasoning strategies offer limited efficacy. The benchmark thus provides a rigorous diagnostic tool and a clear direction for future model development.
📝 Abstract
Recent multimodal large language models (MLLMs) achieve strong performance on visual reasoning benchmarks, yet it remains unclear to what extent such performance reflects reasoning directly grounded in visual evidence. We introduce VisReason, a benchmark for vision-centric reasoning in everyday scenarios where perception and inference are tightly coupled. VisReason contains 1,505 questions across 10 categories spanning perceptual, structural, and conceptual reasoning. Our evaluation shows that VisReason poses a qualitatively different challenge from existing benchmarks, exposing substantial gaps between humans and current MLLMs and revealing limited benefits from test-time reasoning strategies. VisReason offers a focused diagnostic for evaluating vision-centric reasoning beyond language.