🤖 AI Summary
Existing multimodal large language model (MLLM) evaluation benchmarks overemphasize textual reasoning while lacking systematic assessment of vision-dominant cognitive capabilities. Method: We propose MME-CC, the first fine-grained evaluation framework dedicated to visual cognition, covering three core visual reasoning domains—spatial, geometric, and knowledge-based reasoning—and comprising 11 carefully designed tasks. We conduct comprehensive experiments across 16 state-of-the-art MLLMs. Contribution/Results: Our analysis uncovers consistent deficiencies across models in directional perception, cross-view consistency, and counterfactual instruction following—previously uncharacterized weaknesses. We further identify a prevalent “extract–reason–verify” three-stage chain-of-thought pattern in visual reasoning. Results show closed-source models (e.g., Gemini-2.5-Pro, 42.66/100) outperform open-source counterparts overall, yet all exhibit severe limitations in spatial and geometric reasoning (≤30% accuracy). MME-CC establishes a novel, empirically grounded benchmark for advancing multimodal cognitive evaluation and model development.
📝 Abstract
As reasoning models scale rapidly, the essential role of multimodality in human cognition has come into sharp relief, driving a growing need to probe vision-centric cognitive behaviors. Yet, existing multimodal benchmarks either overemphasize textual reasoning or fall short of systematically capturing vision-centric cognitive behaviors, leaving the cognitive capacity of MLLMs insufficiently assessed. To address this limitation, we introduce MME-CC (Multi-Modal Evaluation benchmark of Cognitive Capacity), a vision-grounded benchmark that organizes 11 representative reasoning tasks into three fundamental categories of visual information: spatial, geometric, and knowledge-based reasoning, and provides fine-grained analyses of MLLMs'cognitive capacity across these dimensions. Based on MME-CC, we conduct extensive experiments over 16 representative MLLMs. Our study reveals that closed-source models currently lead overall (e.g., 42.66 for Gemini-2.5-Pro vs. 30.45 for GLM-4.5V), while spatial and geometric reasoning remain broadly weak (less than or equal to 30%). We further identify common error patterns, including orientation mistakes, fragile cross-view identity persistence, and poor adherence to counterfactual instructions, and observe that Chain-of-Thought typically follows a three-stage process (extract ->reason ->verify) with heavy reliance on visual extraction. We hope this work catalyzes a shift toward treating the cognitive capacity of MLLMs as central to both evaluation and model design.