🤖 AI Summary
Existing vision-language models (VLMs) struggle with complex, iterative reasoning in video understanding, while agent-based approaches rely on proprietary models or extensive reinforcement learning (RL) training.
Method: We propose a human-inspired, three-stage reasoning framework—Retrieve-Perceive-Review—that synergizes global exploration and local analysis for evidence retrieval and iterative refinement. Our approach constructs an entity-graph-based structured video knowledge base and integrates reasoning-capable large language models (LLMs), lightweight computer vision (CV) models, and VLMs via multi-granularity tool coordination and open-model fusion—without RL training.
Contribution/Results: Evaluated on long-video benchmarks including LVBench and VideoMME-Long, our method achieves state-of-the-art (SOTA) or near-SOTA performance. It significantly enhances reasoning interpretability and system openness, enabling transparent, modular, and scalable video understanding.
📝 Abstract
Video understanding requires not only visual recognition but also complex reasoning. While Vision-Language Models (VLMs) demonstrate impressive capabilities, they typically process videos largely in a single-pass manner with limited support for evidence revisit and iterative refinement. While recently emerging agent-based methods enable long-horizon reasoning, they either depend heavily on expensive proprietary models or require extensive agentic RL training. To overcome these limitations, we propose Agentic Video Intelligence (AVI), a flexible and training-free framework that can mirror human video comprehension through system-level design and optimization. AVI introduces three key innovations: (1) a human-inspired three-phase reasoning process (Retrieve-Perceive-Review) that ensures both sufficient global exploration and focused local analysis, (2) a structured video knowledge base organized through entity graphs, along with multi-granularity integrated tools, constituting the agent's interaction environment, and (3) an open-source model ensemble combining reasoning LLMs with lightweight base CV models and VLM, eliminating dependence on proprietary APIs or RL training. Experiments on LVBench, VideoMME-Long, LongVideoBench, and Charades-STA demonstrate that AVI achieves competitive performance while offering superior interpretability.