Agentic Video Intelligence: A Flexible Framework for Advanced Video Exploration and Understanding

📅 2025-11-18

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing vision-language models (VLMs) struggle with complex, iterative reasoning in video understanding, while agent-based approaches rely on proprietary models or extensive reinforcement learning (RL) training. Method: We propose a human-inspired, three-stage reasoning framework—Retrieve-Perceive-Review—that synergizes global exploration and local analysis for evidence retrieval and iterative refinement. Our approach constructs an entity-graph-based structured video knowledge base and integrates reasoning-capable large language models (LLMs), lightweight computer vision (CV) models, and VLMs via multi-granularity tool coordination and open-model fusion—without RL training. Contribution/Results: Evaluated on long-video benchmarks including LVBench and VideoMME-Long, our method achieves state-of-the-art (SOTA) or near-SOTA performance. It significantly enhances reasoning interpretability and system openness, enabling transparent, modular, and scalable video understanding.

Technology Category

Application Category

📝 Abstract

Video understanding requires not only visual recognition but also complex reasoning. While Vision-Language Models (VLMs) demonstrate impressive capabilities, they typically process videos largely in a single-pass manner with limited support for evidence revisit and iterative refinement. While recently emerging agent-based methods enable long-horizon reasoning, they either depend heavily on expensive proprietary models or require extensive agentic RL training. To overcome these limitations, we propose Agentic Video Intelligence (AVI), a flexible and training-free framework that can mirror human video comprehension through system-level design and optimization. AVI introduces three key innovations: (1) a human-inspired three-phase reasoning process (Retrieve-Perceive-Review) that ensures both sufficient global exploration and focused local analysis, (2) a structured video knowledge base organized through entity graphs, along with multi-granularity integrated tools, constituting the agent's interaction environment, and (3) an open-source model ensemble combining reasoning LLMs with lightweight base CV models and VLM, eliminating dependence on proprietary APIs or RL training. Experiments on LVBench, VideoMME-Long, LongVideoBench, and Charades-STA demonstrate that AVI achieves competitive performance while offering superior interpretability.

Problem

Research questions and friction points this paper is trying to address.

Enables iterative video reasoning without proprietary models

Facilitates evidence revisit through structured knowledge representation

Eliminates dependency on expensive training or APIs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Three-phase reasoning process for video analysis

Structured video knowledge base with entity graphs

Open-source model ensemble without proprietary dependencies

🔎 Similar Papers

No similar papers found.