MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

📅 2025-11-05

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Existing multimodal large language model (MLLM) evaluation benchmarks overemphasize textual reasoning while lacking systematic assessment of vision-dominant cognitive capabilities. Method: We propose MME-CC, the first fine-grained evaluation framework dedicated to visual cognition, covering three core visual reasoning domains—spatial, geometric, and knowledge-based reasoning—and comprising 11 carefully designed tasks. We conduct comprehensive experiments across 16 state-of-the-art MLLMs. Contribution/Results: Our analysis uncovers consistent deficiencies across models in directional perception, cross-view consistency, and counterfactual instruction following—previously uncharacterized weaknesses. We further identify a prevalent “extract–reason–verify” three-stage chain-of-thought pattern in visual reasoning. Results show closed-source models (e.g., Gemini-2.5-Pro, 42.66/100) outperform open-source counterparts overall, yet all exhibit severe limitations in spatial and geometric reasoning (≤30% accuracy). MME-CC establishes a novel, empirically grounded benchmark for advancing multimodal cognitive evaluation and model development.

Technology Category

Application Category

📝 Abstract

As reasoning models scale rapidly, the essential role of multimodality in human cognition has come into sharp relief, driving a growing need to probe vision-centric cognitive behaviors. Yet, existing multimodal benchmarks either overemphasize textual reasoning or fall short of systematically capturing vision-centric cognitive behaviors, leaving the cognitive capacity of MLLMs insufficiently assessed. To address this limitation, we introduce MME-CC (Multi-Modal Evaluation benchmark of Cognitive Capacity), a vision-grounded benchmark that organizes 11 representative reasoning tasks into three fundamental categories of visual information: spatial, geometric, and knowledge-based reasoning, and provides fine-grained analyses of MLLMs'cognitive capacity across these dimensions. Based on MME-CC, we conduct extensive experiments over 16 representative MLLMs. Our study reveals that closed-source models currently lead overall (e.g., 42.66 for Gemini-2.5-Pro vs. 30.45 for GLM-4.5V), while spatial and geometric reasoning remain broadly weak (less than or equal to 30%). We further identify common error patterns, including orientation mistakes, fragile cross-view identity persistence, and poor adherence to counterfactual instructions, and observe that Chain-of-Thought typically follows a three-stage process (extract ->reason ->verify) with heavy reliance on visual extraction. We hope this work catalyzes a shift toward treating the cognitive capacity of MLLMs as central to both evaluation and model design.

Problem

Research questions and friction points this paper is trying to address.

Assessing multimodal models' vision-centric cognitive capacity through systematic evaluation

Addressing limitations in existing benchmarks for spatial and geometric reasoning tasks

Identifying common error patterns in multimodal models' visual reasoning processes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces vision-grounded benchmark for cognitive capacity

Organizes reasoning tasks into three visual categories

Conducts extensive experiments on 16 multimodal models

🔎 Similar Papers

M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models