🤖 AI Summary
Current multimodal large language models (MLLMs) lack systematic evaluation for human-centric video understanding, primarily due to the absence of comprehensive benchmarks—existing ones emphasize generative quality and action recognition while neglecting higher-order cognitive capabilities such as attribute perception, emotion recognition, social relationship inference, and intention reasoning, and are further constrained by single-question formats and coarse-grained metrics. Method: We propose HV-MMBench, the first holistic benchmark dedicated to human-centric video understanding, covering 15 tasks across 50 diverse scenarios, supporting multiple question types (multiple-choice, fill-in-the-blank, open-ended QA), variable video durations (second- to minute-scale), and multi-level cognitive dimensions. It is constructed via frame sampling, annotation refinement, question generation, and fine-grained metric design. Contribution/Results: Empirical evaluation exposes critical weaknesses of state-of-the-art MLLMs in social reasoning and temporal modeling, establishing HV-MMBench as a reproducible, extensible evaluation paradigm for future research.
📝 Abstract
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos. However, their capacity to comprehend human-centric video data remains underexplored, primarily due to the absence of comprehensive and high-quality evaluation benchmarks. Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios. Furthermore, they are often limited by single-question paradigms and overly simplistic evaluation metrics. To address above limitations, we propose a modern HV-MMBench, a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding. Compared to existing human-centric video benchmarks, our work offers the following key features: (1) Diverse evaluation dimensions: HV-MMBench encompasses 15 tasks, ranging from basic attribute perception (e.g., age estimation, emotion recognition) to advanced cognitive reasoning (e.g., social relationship prediction, intention prediction), enabling comprehensive assessment of model capabilities; (2) Varied data types: The benchmark includes multiple-choice, fill-in-blank, true/false, and open-ended question formats, combined with diverse evaluation metrics, to more accurately and robustly reflect model performance; (3) Multi-domain video coverage: The benchmark spans 50 distinct visual scenarios, enabling comprehensive evaluation across fine-grained scene variations; (4) Temporal coverage: The benchmark covers videos from short-term (10 seconds) to long-term (up to 30min) durations, supporting systematic analysis of models temporal reasoning abilities across diverse contextual lengths.