AgentCVR: Active Multi-Agent Cross-Video Reasoning via Script-Simulated Reinforcement Learning

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing approaches struggle to effectively retrieve, align, and aggregate scattered critical evidence across multiple videos, often overlooking rare yet pivotal information. This work proposes a multi-agent active reasoning framework that formulates the task as an active evidence acquisition process, wherein a central agent coordinates specialized visual and audio agents to iteratively perform goal-directed evidence extraction. By incorporating LLM-generated semantic scripts and a lightweight text-based simulator, the framework enables efficient training via reinforcement learning while circumventing the computational overhead of online multimodal inference. Evaluated on a comprehensive cross-video reasoning benchmark, the method substantially outperforms single-pass reasoning baselines and achieves performance on par with state-of-the-art closed-source systems in complex alignment and localization tasks.

📝 Abstract

Cross-Video Reasoning (CVR) has emerged as a critical frontier in multimodal intelligence, requiring models to retrieve, align, and aggregate evidence distributed across multiple videos. Current Multimodal Large Language Models (MLLMs) often struggle with CVR, as simple single-pass strategies encode multiple videos into a shared compressed context, potentially obscuring rare but critical evidence. In this paper, we propose AgentCVR, a multi-agent framework that treats CVR as an active evidence-acquisition task. AgentCVR employs a Master Agent to iteratively coordinate specialized Visual and Audio Agents for targeted evidence extraction. To ensure efficient training, we introduce Script-Simulated RL, which optimizes the agent's policy with LLM-generated semantic scripts and a lightweight text-based simulator, bypassing costly multimodal inference during online exploration. Experimental results on a comprehensive CVR benchmark show that AgentCVR outperforms single-pass baselines and achieves comparable performance to state-of-the-art closed-source systems, particularly in complex cross-video alignment and localization. To ensure reproducibility, our code is available at https://github.com/wang-jh24/AgentCVR.

Problem

Research questions and friction points this paper is trying to address.

Cross-Video Reasoning

Multimodal Large Language Models

Evidence Aggregation

Video Alignment

Active Reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Agent Framework

Cross-Video Reasoning

Script-Simulated Reinforcement Learning