HAVEN: Hierarchically Aligned Multimodal Benchmark for Unified Video Understanding

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing video understanding benchmarks struggle to evaluate the hierarchical, cross-modal semantic alignment capabilities of large multimodal models in complex narratives. To address this gap, this work proposes the first comprehensive benchmark that explicitly aligns all modalities across three granularities—frames, shots, and full videos—through hierarchical human annotations and a cross-modal alignment framework. The benchmark supports multi-task evaluation, including summarization, temporal reasoning, cross-modal localization, and salience ranking, moving beyond the limitations of conventional question-answering paradigms. It reveals a significant disparity between current models’ linguistic fluency and their deeper cross-modal comprehension, while providing a standardized evaluation platform and publicly available data resources to advance research in interpretable, hierarchical video understanding.

📝 Abstract

While Multimodal Large Language Models (MLLMs) exhibit strong performance on standard video tasks, their ability to faithfully summarize and reason over complex narratives remains poorly evaluated. Existing summarization benchmarks fragment supervision across isolated granularities, such as keyframes, key shots, or disjointed text summaries, failing to capture the inherently hierarchical structure of cross-modal alignment. To address this critical gap, we introduce HAVEN, a hierarchically aligned multimodal benchmark for unified video understanding. HAVEN pioneers a fully granular (frame, shot, and video levels) and fully multimodal (video and text) dataset architecture, complete with explicit, continuous alignment between modalities. Built upon this unified annotation paradigm, we propose a comprehensive evaluation suite spanning summarization, temporal reasoning, multimodal grounding, and saliency ranking. Extensive benchmarking of state-of-the-art MLLMs exposes a persistent gap between surface-level textual fluency and grounded multimodal understanding. Ultimately, HAVEN advances the evaluation of multimodal systems beyond traditional QA formats, offering a rigorous, standardized testbed to drive future research in interpretable, hierarchical video understanding. We publicly release the dataset, benchmark suite, and evaluation protocols.

Problem

Research questions and friction points this paper is trying to address.

multimodal large language models

video understanding

hierarchical alignment

summarization benchmark

cross-modal alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical alignment

multimodal benchmark

video understanding