HumanPCR: Probing MLLM Capabilities in Diverse Human-Centric Scenes

📅 2025-08-19

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the insufficient evaluation of perception, comprehension, and reasoning capabilities of multimodal large language models (MLLMs) in human-centric visual scenarios. To this end, we propose the first systematic benchmark framework spanning three hierarchical capabilities—perception, understanding, and reasoning—comprising nine dimensions, 6,000+ manually verified multiple-choice and video-reasoning questions. It introduces a novel class of complex video-reasoning tasks requiring active visual evidence extraction, accompanied by human-annotated chain-of-thought rationales and precise visual evidence localization. Extensive evaluation across 30+ state-of-the-art MLLMs reveals critical deficiencies in spatial relation modeling, temporal dynamics understanding, and theory-of-mind reasoning. Notably, merely scaling visual context or incorporating test-time inference yields only marginal improvements. This benchmark enables fine-grained capability diagnosis and provides actionable insights for future MLLM architecture design.

Technology Category

Application Category

📝 Abstract

The aspiration for artificial general intelligence, fueled by the rapid progress of multimodal models, demands human-comparable performance across diverse environments. We propose HumanPCR, an evaluation suite for probing MLLMs' capacity about human-related visual contexts across three hierarchical levels: Perception, Comprehension, and Reasoning (denoted by Human-P, Human-C, and Human-R, respectively). Human-P and Human-C feature over 6,000 human-verified multiple choice questions, assessing massive tasks of 9 dimensions, including but not limited to essential skills frequently overlooked by existing benchmarks. Human-R offers a challenging manually curated video reasoning test that requires integrating multiple visual evidences, proactively extracting context beyond question cues, and applying human-like expertise. Each question includes human-annotated Chain-of-Thought (CoT) rationales with key visual evidence to support further research. Extensive evaluations on over 30 state-of-the-art models exhibit significant challenges in human-centric visual understanding, particularly in tasks involving detailed space perception, temporal understanding, and mind modeling. Moreover, analysis of Human-R reveals the struggle of models in extracting essential proactive visual evidence from diverse human scenes and their faulty reliance on query-guided retrieval. Even with advanced techniques like scaling visual contexts and test-time thinking yield only limited benefits. We hope HumanPCR and our findings will advance the development, evaluation, and human-centric application of multimodal models.

Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' human-centric visual understanding across perception, comprehension, reasoning

Assessing capabilities in detailed space perception, temporal understanding, and mind modeling

Addressing models' struggle with proactive visual evidence extraction from human scenes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical evaluation suite with perception comprehension reasoning levels

Human-verified multiple choice questions with Chain-of-Thought rationales

Manually curated video reasoning test requiring proactive evidence extraction

🔎 Similar Papers

No similar papers found.

Authors to Follow