Can MLLMs Reason Beyond Language? VisReason: A Comprehensive Benchmark for Vision-Centric Reasoning

📅 2026-05-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing visual reasoning benchmarks struggle to disentangle whether the performance of multimodal large language models (MLLMs) stems from genuine visual understanding or reliance on linguistic priors. To address this limitation, this work proposes VisReason, a novel benchmark that systematically defines and constructs vision-centric reasoning tasks in which perception and reasoning are tightly coupled within everyday scenarios. VisReason comprises 1,505 carefully curated questions across 10 categories, meticulously annotated along fine-grained dimensions of perceptual, structural, and conceptual reasoning. Comparative evaluation against human baselines reveals substantial gaps in current MLLMs’ capabilities on such tasks, demonstrating that existing test-time reasoning strategies offer limited efficacy. The benchmark thus provides a rigorous diagnostic tool and a clear direction for future model development.
📝 Abstract
Recent multimodal large language models (MLLMs) achieve strong performance on visual reasoning benchmarks, yet it remains unclear to what extent such performance reflects reasoning directly grounded in visual evidence. We introduce VisReason, a benchmark for vision-centric reasoning in everyday scenarios where perception and inference are tightly coupled. VisReason contains 1,505 questions across 10 categories spanning perceptual, structural, and conceptual reasoning. Our evaluation shows that VisReason poses a qualitatively different challenge from existing benchmarks, exposing substantial gaps between humans and current MLLMs and revealing limited benefits from test-time reasoning strategies. VisReason offers a focused diagnostic for evaluating vision-centric reasoning beyond language.
Problem

Research questions and friction points this paper is trying to address.

vision-centric reasoning
multimodal large language models
visual reasoning benchmark
perception-inference coupling
Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-centric reasoning
multimodal large language models
VisReason benchmark
visual evidence grounding
reasoning evaluation
🔎 Similar Papers
L
Longteng Guo
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
Y
Yifan Wang
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
P
Pengkang Huo
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
T
Tailai Chen
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
Yuze Wu
Yuze Wu
Zhejiang University
Control & PlanningRobot LearningEmbodied Intelligence
J
Jing Liu
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
X
Xinxin Zhu
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences