🤖 AI Summary
Existing benchmarks inadequately evaluate the logical reasoning capabilities of multimodal large language models (MLLMs), primarily due to the absence of fine-grained logical reasoning typology and the entanglement of perceptual and factual knowledge interference.
Method: We introduce MLLogic—the first pure logical reasoning benchmark for MLLMs—covering induction, deduction, and abduction. It adheres to perception-decoupled and knowledge-decoupled question-design principles, enabling granular capability diagnosis. Our methodology employs data-driven multimodal problem construction, explicit logical-type annotation, controllable difficulty scaling, and multi-stage response analysis. We further propose a “thinking-mode” framework and a rule-guided reinforcement learning contrastive paradigm.
Results: Experiments reveal that state-of-the-art MLLMs achieve <40% overall accuracy on MLLogic, with performance gaps exceeding 25% across the three reasoning types. Moreover, prevailing reasoning-augmentation techniques yield limited and inconsistent improvements.
📝 Abstract
Logical reasoning is a fundamental aspect of human intelligence and an essential capability for multimodal large language models (MLLMs). Despite the significant advancement in multimodal reasoning, existing benchmarks fail to comprehensively evaluate their reasoning abilities due to the lack of explicit categorization for logical reasoning types and an unclear understanding of reasoning. To address these issues, we introduce MME-Reasoning, a comprehensive benchmark designed to evaluate the reasoning ability of MLLMs, which covers all three types of reasoning (i.e., inductive, deductive, and abductive) in its questions. We carefully curate the data to ensure that each question effectively evaluates reasoning ability rather than perceptual skills or knowledge breadth, and extend the evaluation protocols to cover the evaluation of diverse questions. Our evaluation reveals substantial limitations of state-of-the-art MLLMs when subjected to holistic assessments of logical reasoning capabilities. Even the most advanced MLLMs show limited performance in comprehensive logical reasoning, with notable performance imbalances across reasoning types. In addition, we conducted an in-depth analysis of approaches such as ``thinking mode'' and Rule-based RL, which are commonly believed to enhance reasoning abilities. These findings highlight the critical limitations and performance imbalances of current MLLMs in diverse logical reasoning scenarios, providing comprehensive and systematic insights into the understanding and evaluation of reasoning capabilities.