ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing video understanding methods predominantly rely on single-pass inference, lacking dynamic feedback and self-correction capabilities; while reward-modeling and reinforcement learning (RL) approaches suffer from high annotation costs, delayed reward signals, and low inference efficiency. This paper proposes a reward-driven multi-agent video understanding framework, introducing—novelty—the first real-time frame-level reward modeling and a tri-perspective (conservative/neutral/aggressive) reflection mechanism, enabling dynamic keyframe selection and iterative answer refinement across multiple rounds. The framework establishes a closed loop integrating inference, optimization, and automatic construction of high-quality supervised fine-tuning (SFT), direct preference optimization (DPO), and generative reward policy optimization (GRPO) data. Lightweight, modular, and supporting multi-tool collaboration, it achieves significant gains across 12 benchmarks: +6.9% in video understanding, +2.1% in video reasoning, and +9.8% in vision-language-action (VLA) model alignment—demonstrating strong generalization and task adaptability.

Technology Category

Application Category

📝 Abstract

Video understanding is fundamental to tasks such as action recognition, video reasoning, and robotic control. Early video understanding methods based on large vision-language models (LVLMs) typically adopt a single-pass reasoning paradigm without dynamic feedback, limiting the model's capacity to self-correct and adapt in complex scenarios. Recent efforts have attempted to address this limitation by incorporating reward models and reinforcement learning to enhance reasoning, or by employing tool-agent frameworks. However, these approaches face several challenges, including high annotation costs, reward signals that fail to capture real-time reasoning states, and low inference efficiency. To overcome these issues, we propose ReAgent-V, a novel agentic video understanding framework that integrates efficient frame selection with real-time reward generation during inference. These reward signals not only guide iterative answer refinement through a multi-perspective reflection mechanism-adjusting predictions from conservative, neutral, and aggressive viewpoints-but also enable automatic filtering of high-quality data for supervised fine-tuning (SFT), direct preference optimization (DPO), and group relative policy optimization (GRPO). ReAgent-V is lightweight, modular, and extensible, supporting flexible tool integration tailored to diverse tasks. Extensive experiments on 12 datasets across three core applications-video understanding, video reasoning enhancement, and vision-language-action model alignment-demonstrate significant gains in generalization and reasoning, with improvements of up to 6.9%, 2.1%, and 9.8%, respectively, highlighting the effectiveness and versatility of the proposed framework.

Problem

Research questions and friction points this paper is trying to address.

Enhancing video understanding with dynamic feedback and self-correction

Reducing annotation costs and improving real-time reward signals

Increasing inference efficiency and reasoning accuracy in complex scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reward-driven multi-agent framework for videos

Real-time reward generation and frame selection

Multi-perspective reflection for answer refinement

🔎 Similar Papers

No similar papers found.