HV-MMBench: Benchmarking MLLMs for Human-Centric Video Understanding

📅 2025-07-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) lack systematic evaluation for human-centric video understanding, primarily due to the absence of comprehensive benchmarks—existing ones emphasize generative quality and action recognition while neglecting higher-order cognitive capabilities such as attribute perception, emotion recognition, social relationship inference, and intention reasoning, and are further constrained by single-question formats and coarse-grained metrics. Method: We propose HV-MMBench, the first holistic benchmark dedicated to human-centric video understanding, covering 15 tasks across 50 diverse scenarios, supporting multiple question types (multiple-choice, fill-in-the-blank, open-ended QA), variable video durations (second- to minute-scale), and multi-level cognitive dimensions. It is constructed via frame sampling, annotation refinement, question generation, and fine-grained metric design. Contribution/Results: Empirical evaluation exposes critical weaknesses of state-of-the-art MLLMs in social reasoning and temporal modeling, establishing HV-MMBench as a reproducible, extensible evaluation paradigm for future research.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos. However, their capacity to comprehend human-centric video data remains underexplored, primarily due to the absence of comprehensive and high-quality evaluation benchmarks. Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios. Furthermore, they are often limited by single-question paradigms and overly simplistic evaluation metrics. To address above limitations, we propose a modern HV-MMBench, a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding. Compared to existing human-centric video benchmarks, our work offers the following key features: (1) Diverse evaluation dimensions: HV-MMBench encompasses 15 tasks, ranging from basic attribute perception (e.g., age estimation, emotion recognition) to advanced cognitive reasoning (e.g., social relationship prediction, intention prediction), enabling comprehensive assessment of model capabilities; (2) Varied data types: The benchmark includes multiple-choice, fill-in-blank, true/false, and open-ended question formats, combined with diverse evaluation metrics, to more accurately and robustly reflect model performance; (3) Multi-domain video coverage: The benchmark spans 50 distinct visual scenarios, enabling comprehensive evaluation across fine-grained scene variations; (4) Temporal coverage: The benchmark covers videos from short-term (10 seconds) to long-term (up to 30min) durations, supporting systematic analysis of models temporal reasoning abilities across diverse contextual lengths.
Problem

Research questions and friction points this paper is trying to address.

Assessing MLLMs' human-centric video comprehension lacking benchmarks
Evaluating perceptual and cognitive abilities in human-centered scenarios
Overcoming single-question limits with diverse tasks and metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diverse evaluation dimensions with 15 tasks
Varied data types and evaluation metrics
Multi-domain and temporal video coverage
🔎 Similar Papers
No similar papers found.