🤖 AI Summary
Long video understanding faces significant challenges, including high spatiotemporal complexity and difficulty in modeling long-range contextual dependencies; existing LLM-based approaches exhibit limited performance on information-dense videos spanning several hours. To address this, we propose the first agentic video search paradigm—replacing rigid, hand-crafted pipelines with an autonomous, LLM-driven framework that dynamically plans queries, invokes specialized tools, and iteratively refines reasoning over a multi-granularity video database. Our core contributions include: (1) a hierarchical video segmentation and indexing scheme enabling efficient spatiotemporal retrieval; (2) a retrieval-augmented toolkit tailored for video-specific operations (e.g., temporal localization, semantic summarization); and (3) an LLM-guided closed-loop mechanism for state-aware reasoning and adaptive parameter generation. Evaluated on LVBench and other long-video benchmarks, our method achieves new state-of-the-art performance, substantially outperforming prior approaches. Ablation studies confirm both the individual efficacy and synergistic benefits of each component.
📝 Abstract
Long-form video understanding presents significant challenges due to extensive temporal-spatial complexity and the difficulty of question answering under such extended contexts. While Large Language Models (LLMs) have demonstrated considerable advancements in video analysis capabilities and long context handling, they continue to exhibit limitations when processing information-dense hour-long videos. To overcome such limitations, we propose the Deep Video Discovery agent to leverage an agentic search strategy over segmented video clips. Different from previous video agents manually designing a rigid workflow, our approach emphasizes the autonomous nature of agents. By providing a set of search-centric tools on multi-granular video database, our DVD agent leverages the advanced reasoning capability of LLM to plan on its current observation state, strategically selects tools, formulates appropriate parameters for actions, and iteratively refines its internal reasoning in light of the gathered information. We perform comprehensive evaluation on multiple long video understanding benchmarks that demonstrates the advantage of the entire system design. Our DVD agent achieves SOTA performance, significantly surpassing prior works by a large margin on the challenging LVBench dataset. Comprehensive ablation studies and in-depth tool analyses are also provided, yielding insights to further advance intelligent agents tailored for long-form video understanding tasks. The code will be released later.