MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

📅 2025-11-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language model (MLLM) evaluation benchmarks overemphasize textual reasoning while lacking systematic assessment of vision-dominant cognitive capabilities. Method: We propose MME-CC, the first fine-grained evaluation framework dedicated to visual cognition, covering three core visual reasoning domains—spatial, geometric, and knowledge-based reasoning—and comprising 11 carefully designed tasks. We conduct comprehensive experiments across 16 state-of-the-art MLLMs. Contribution/Results: Our analysis uncovers consistent deficiencies across models in directional perception, cross-view consistency, and counterfactual instruction following—previously uncharacterized weaknesses. We further identify a prevalent “extract–reason–verify” three-stage chain-of-thought pattern in visual reasoning. Results show closed-source models (e.g., Gemini-2.5-Pro, 42.66/100) outperform open-source counterparts overall, yet all exhibit severe limitations in spatial and geometric reasoning (≤30% accuracy). MME-CC establishes a novel, empirically grounded benchmark for advancing multimodal cognitive evaluation and model development.

Technology Category

Application Category

📝 Abstract
As reasoning models scale rapidly, the essential role of multimodality in human cognition has come into sharp relief, driving a growing need to probe vision-centric cognitive behaviors. Yet, existing multimodal benchmarks either overemphasize textual reasoning or fall short of systematically capturing vision-centric cognitive behaviors, leaving the cognitive capacity of MLLMs insufficiently assessed. To address this limitation, we introduce MME-CC (Multi-Modal Evaluation benchmark of Cognitive Capacity), a vision-grounded benchmark that organizes 11 representative reasoning tasks into three fundamental categories of visual information: spatial, geometric, and knowledge-based reasoning, and provides fine-grained analyses of MLLMs'cognitive capacity across these dimensions. Based on MME-CC, we conduct extensive experiments over 16 representative MLLMs. Our study reveals that closed-source models currently lead overall (e.g., 42.66 for Gemini-2.5-Pro vs. 30.45 for GLM-4.5V), while spatial and geometric reasoning remain broadly weak (less than or equal to 30%). We further identify common error patterns, including orientation mistakes, fragile cross-view identity persistence, and poor adherence to counterfactual instructions, and observe that Chain-of-Thought typically follows a three-stage process (extract ->reason ->verify) with heavy reliance on visual extraction. We hope this work catalyzes a shift toward treating the cognitive capacity of MLLMs as central to both evaluation and model design.
Problem

Research questions and friction points this paper is trying to address.

Assessing multimodal models' vision-centric cognitive capacity through systematic evaluation
Addressing limitations in existing benchmarks for spatial and geometric reasoning tasks
Identifying common error patterns in multimodal models' visual reasoning processes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces vision-grounded benchmark for cognitive capacity
Organizes reasoning tasks into three visual categories
Conducts extensive experiments on 16 multimodal models
🔎 Similar Papers
No similar papers found.
K
Kaiyuan Zhang
ByteDance Seed,Nanjing University
Chenghao Yang
Chenghao Yang
University of Chicago
Human-AI AlignmentNLPMLCommunication & Intelligence
Zhoufutu Wen
Zhoufutu Wen
ByteDance SEED
LLM Evaluation
S
Sihang Yuan
ByteDance Seed,Nanjing University
Qiuyue Wang
Qiuyue Wang
School of Information, Renmin University of China
information extractionknowledge graphknowledge reasoning
C
Chaoyi Huang
ByteDance Seed,Nanjing University
G
Guosheng Zhu
ByteDance Seed,Nanjing University
H
He Wang
ByteDance Seed,Nanjing University
H
Huawenyu Lu
ByteDance Seed,Nanjing University
J
Jianing Wen
ByteDance Seed,Nanjing University
J
Jianpeng Jiao
ByteDance Seed,Nanjing University
L
Lishu Luo
ByteDance Seed,Nanjing University
L
Longxiang Liu
ByteDance Seed,Nanjing University
S
Sijin Wu
ByteDance Seed,Nanjing University
X
Xiaolei Zhu
ByteDance Seed,Nanjing University
Xuanliang Zhang
Xuanliang Zhang
Harbin Institute of Technology
Natural Language ProcessSemantic ParsingTable Reasoning
G
Ge Zhang
ByteDance Seed,Nanjing University
Y
Yi Lin
ByteDance Seed,Nanjing University
G
Guang Shi
ByteDance Seed,Nanjing University
Chaoyou Fu
Chaoyou Fu
Nanjing University
Multimodal LLMLLMBiometrics
W
Wenhao Huang
ByteDance Seed,Nanjing University