Hierarchical Long Video Understanding with Audiovisual Entity Cohesion and Agentic Search

📅 2026-01-20

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work addresses the challenges of information fragmentation and semantic incoherence in long-form video understanding caused by excessively long contextual spans. To this end, the authors propose HAVEN, a unified hierarchical framework that integrates audio-visual entity consistency modeling with an agent-driven, multi-granularity retrieval mechanism. HAVEN employs a four-level structured index—spanning global summaries, scenes, clips, and entities—combined with audio-visual multimodal fusion and entity-level representation alignment to enable dynamic cross-hierarchical retrieval and reasoning. This design effectively preserves temporal coherence and entity consistency throughout the video. Evaluated on LVBench, the method achieves 84.1% overall accuracy and 80.1% on reasoning tasks, significantly outperforming existing approaches.

Technology Category

Application Category

📝 Abstract

Long video understanding presents significant challenges for vision-language models due to extremely long context windows. Existing solutions relying on naive chunking strategies with retrieval-augmented generation, typically suffer from information fragmentation and a loss of global coherence. We present HAVEN, a unified framework for long-video understanding that enables coherent and comprehensive reasoning by integrating audiovisual entity cohesion and hierarchical video indexing with agentic search. First, we preserve semantic consistency by integrating entity-level representations across visual and auditory streams, while organizing content into a structured hierarchy spanning global summary, scene, segment, and entity levels. Then we employ an agentic search mechanism to enable dynamic retrieval and reasoning across these layers, facilitating coherent narrative reconstruction and fine-grained entity tracking. Extensive experiments demonstrate that our method achieves good temporal coherence, entity consistency, and retrieval efficiency, establishing a new state-of-the-art with an overall accuracy of 84.1% on LVBench. Notably, it achieves outstanding performance in the challenging reasoning category, reaching 80.1%. These results highlight the effectiveness of structured, multimodal reasoning for comprehensive and context-consistent understanding of long-form videos.

Problem

Research questions and friction points this paper is trying to address.

long video understanding

information fragmentation

global coherence

vision-language models

context windows

Innovation

Methods, ideas, or system contributions that make the work stand out.

audiovisual entity cohesion

hierarchical video indexing

agentic search