VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos

📅 2024-05-29

🏛️ arXiv.org

📈 Citations: 34

✨ Influential: 2

career value

211K/year

🤖 AI Summary

Long-form video understanding suffers from high redundancy and abundant irrelevant content. To address this, we propose VideoTree—a training-free framework that constructs query-adaptive, hierarchical tree-structured video representations to enable efficient, fine-grained reasoning with large language models (LLMs). Our key contributions are: (1) the first query-driven iterative keyframe selection mechanism; (2) explicit modeling of inherent video hierarchies as multi-granularity tree representations for coarse-to-fine relevance extraction; and (3) a pure inference paradigm integrating query-aware information aggregation with end-to-end LLM reasoning. Evaluated on EgoSchema and NExT-QA, VideoTree achieves 61.1% and 75.6% accuracy, respectively. On Video-MME—a benchmark comprising 44-minute average-length videos—it significantly outperforms GPT-4V and leading trained multimodal large models.

Technology Category

Application Category

📝 Abstract

Long-form video understanding is complicated by the high redundancy of video data and the abundance of query-irrelevant information. To tackle these challenges, we propose VideoTree, a training-free framework which builds a query-adaptive and hierarchical video representation for LLM reasoning over long-form videos. First, VideoTree extracts query-relevant information from the input video through an iterative process, progressively refining the selection of keyframes based on their relevance to the query. Furthermore, VideoTree leverages the inherent hierarchical structure of long video data, which is often overlooked by existing LLM-based methods. Specifically, we incorporate multi-granularity information into a tree-based representation, allowing VideoTree to extract query-relevant details from long videos in a coarse-to-fine manner. This enables the model to effectively handle a wide range of video queries with varying levels of detail. Finally, VideoTree aggregates the hierarchical query-relevant information within the tree structure and feeds it into an LLM reasoning model to answer the query. Our experiments show that our method improves both reasoning accuracy and efficiency. Specifically, VideoTree outperforms existing training-free approaches on EgoSchema and NExT-QA with less inference time, achieving 61.1% and 75.6% accuracy on the test set without additional video-specific training. Moreover, on the long split of Video-MME (average 44 minutes), VideoTree achieves better performance than GPT-4V and many other MLLMs that were extensively trained on video data.

Problem

Research questions and friction points this paper is trying to address.

Addresses high redundancy in long-form video data.

Extracts query-relevant information hierarchically from videos.

Improves reasoning accuracy and efficiency for video queries.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Query-adaptive hierarchical video representation

Iterative keyframe selection for relevance

Tree-based multi-granularity information extraction

🔎 Similar Papers

No similar papers found.