CAViAR: Critic-Augmented Video Agentic Reasoning

📅 2025-09-09

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Existing video understanding models perform well on short-video perception tasks but suffer significant degradation in reasoning capability when handling long videos and complex queries. To address this, we propose a large language model (LLM)-driven dynamic video reasoning agent framework. It employs LLM-guided stepwise reasoning to selectively invoke specialized video submodules on demand and integrates a learnable critique network that continuously evaluates and rectifies the reasoning trajectory, enabling adaptive responses to varying task complexity and video length. Departing from conventional fixed-pipeline architectures, our framework supports end-to-end differentiable, dynamically planned reasoning. Evaluated on benchmarks including LVBench, Neptune, and ActivityNet-RTL, our method achieves substantial improvements in accuracy for complex video reasoning tasks—particularly for long-form video understanding—establishing a novel paradigm for deep, adaptive video comprehension.

Technology Category

Application Category

📝 Abstract

Video understanding has seen significant progress in recent years, with models' performance on perception from short clips continuing to rise. Yet, multiple recent benchmarks, such as LVBench, Neptune, and ActivityNet-RTL, show performance wanes for tasks requiring complex reasoning on videos as queries grow more complex and videos grow longer. In this work, we ask: can existing perception capabilities be leveraged to successfully perform more complex video reasoning? In particular, we develop a large language model agent given access to video modules as subagents or tools. Rather than following a fixed procedure to solve queries as in previous work such as Visual Programming, ViperGPT, and MoReVQA, the agent uses the results of each call to a module to determine subsequent steps. Inspired by work in the textual reasoning domain, we introduce a critic to distinguish between instances of successful and unsuccessful sequences from the agent. We show that the combination of our agent and critic achieve strong performance on the previously-mentioned datasets.

Problem

Research questions and friction points this paper is trying to address.

Enhancing complex reasoning in long videos

Leveraging perception for advanced video understanding

Improving agentic decision-making with critic feedback

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agent uses video modules as tools

Critic distinguishes successful and unsuccessful sequences

Dynamic procedure based on module results

🔎 Similar Papers

No similar papers found.