VCA: Video Curious Agent for Long Video Understanding

📅 2024-12-12

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

261K/year

🤖 AI Summary

Long-video understanding faces challenges of high temporal complexity and low information density; existing approaches—such as dense frame sampling or LLM-augmented methods—entail prohibitive computational overhead. To address this, we propose a vision-language model (VLM)-based autonomous video agent that introduces a novel curiosity-driven intrinsic reward mechanism: the VLM self-generates rewards to guide tree search and dynamically localize salient video segments. Our method establishes an unsupervised, external-feedback-free paradigm for adaptive video navigation. It integrates hierarchical paragraph-level reasoning with lightweight intrinsic reward modeling, eliminating both dense frame sampling and external tool invocation. Evaluated on multiple long-video understanding benchmarks, our approach achieves significant performance gains while accelerating inference by 3.2× and reducing computational cost by 76%.

Technology Category

Application Category

📝 Abstract

Long video understanding poses unique challenges due to their temporal complexity and low information density. Recent works address this task by sampling numerous frames or incorporating auxiliary tools using LLMs, both of which result in high computational costs. In this work, we introduce a curiosity-driven video agent with self-exploration capability, dubbed as VCA. Built upon VLMs, VCA autonomously navigates video segments and efficiently builds a comprehensive understanding of complex video sequences. Instead of directly sampling frames, VCA employs a tree-search structure to explore video segments and collect frames. Rather than relying on external feedback or reward, VCA leverages VLM's self-generated intrinsic reward to guide its exploration, enabling it to capture the most crucial information for reasoning. Experimental results on multiple long video benchmarks demonstrate our approach's superior effectiveness and efficiency.

Problem

Research questions and friction points this paper is trying to address.

Addresses challenges in long video understanding

Reduces computational costs in video analysis

Enhances video comprehension with self-exploration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Curiosity-driven video agent with self-exploration

Tree-search structure for video segment exploration

VLM's intrinsic reward for crucial information capture

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs