🤖 AI Summary
This work addresses the insufficient evaluation of perception, comprehension, and reasoning capabilities of multimodal large language models (MLLMs) in human-centric visual scenarios. To this end, we propose the first systematic benchmark framework spanning three hierarchical capabilities—perception, understanding, and reasoning—comprising nine dimensions, 6,000+ manually verified multiple-choice and video-reasoning questions. It introduces a novel class of complex video-reasoning tasks requiring active visual evidence extraction, accompanied by human-annotated chain-of-thought rationales and precise visual evidence localization. Extensive evaluation across 30+ state-of-the-art MLLMs reveals critical deficiencies in spatial relation modeling, temporal dynamics understanding, and theory-of-mind reasoning. Notably, merely scaling visual context or incorporating test-time inference yields only marginal improvements. This benchmark enables fine-grained capability diagnosis and provides actionable insights for future MLLM architecture design.
📝 Abstract
The aspiration for artificial general intelligence, fueled by the rapid progress of multimodal models, demands human-comparable performance across diverse environments. We propose HumanPCR, an evaluation suite for probing MLLMs' capacity about human-related visual contexts across three hierarchical levels: Perception, Comprehension, and Reasoning (denoted by Human-P, Human-C, and Human-R, respectively). Human-P and Human-C feature over 6,000 human-verified multiple choice questions, assessing massive tasks of 9 dimensions, including but not limited to essential skills frequently overlooked by existing benchmarks. Human-R offers a challenging manually curated video reasoning test that requires integrating multiple visual evidences, proactively extracting context beyond question cues, and applying human-like expertise. Each question includes human-annotated Chain-of-Thought (CoT) rationales with key visual evidence to support further research. Extensive evaluations on over 30 state-of-the-art models exhibit significant challenges in human-centric visual understanding, particularly in tasks involving detailed space perception, temporal understanding, and mind modeling. Moreover, analysis of Human-R reveals the struggle of models in extracting essential proactive visual evidence from diverse human scenes and their faulty reliance on query-guided retrieval. Even with advanced techniques like scaling visual contexts and test-time thinking yield only limited benefits. We hope HumanPCR and our findings will advance the development, evaluation, and human-centric application of multimodal models.