LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents

📅 2025-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language models (MLLMs) face significant bottlenecks in temporal modeling and deep understanding of long videos, and single-agent paradigms struggle with complex temporal reasoning. This paper introduces the first multi-MLLM agent framework supporting multi-turn dynamic collaboration for long-video understanding. It employs a four-stage mechanism: task-driven agent selection, temporal-aware video chunk retrieval, rationale-exchange-based collaborative reasoning, and dynamic reflective optimization—enabling progressive comprehension of long videos. Key contributions include: (1) the first multi-MLLM multi-turn collaborative paradigm; (2) a task-aware, self-adaptive agent team formation and iterative optimization mechanism; and (3) overcoming fundamental limitations of single-agent temporal modeling. Evaluated on four long-video benchmarks, our framework achieves an average accuracy of 80%, outperforming the state-of-the-art by 14.3% on LongVideoBench—the first method to comprehensively surpass GPT-4o, InternVL-2.5, and Qwen2-VL.

Technology Category

Application Category

📝 Abstract
Existing Multimodal Large Language Models (MLLMs) encounter significant challenges in modeling the temporal context within long videos. Currently, mainstream Agent-based methods use external tools (e.g., search engine, memory banks, OCR, retrieval models) to assist a single MLLM in answering long video questions. Despite such tool-based support, a solitary MLLM still offers only a partial understanding of long videos, resulting in limited performance. In order to better address long video tasks, we introduce LVAgent, the first framework enabling multi-round dynamic collaboration of MLLM agents in long video understanding. Our methodology consists of four key steps: 1. Selection: We pre-select appropriate agents from the model library to form optimal agent teams based on different tasks. 2. Perception: We design an effective retrieval scheme for long videos, improving the coverage of critical temporal segments while maintaining computational efficiency. 3. Action: Agents answer long video-related questions and exchange reasons. 4. Reflection: We evaluate the performance of each agent in each round of discussion and optimize the agent team for dynamic collaboration. The agents iteratively refine their answers by multi-round dynamical collaboration of MLLM agents. LVAgent is the first agent system method that outperforms all closed-source models (including GPT-4o) and open-source models (including InternVL-2.5 and Qwen2-VL) in the long video understanding tasks. Our LVAgent achieves an accuracy of 80% on four mainstream long video understanding tasks. Notably, on the LongVideoBench dataset, LVAgent improves accuracy by up to 14.3% compared with SOTA.
Problem

Research questions and friction points this paper is trying to address.

Challenges in modeling temporal context in long videos.
Limited performance of single MLLM in long video understanding.
Need for dynamic collaboration among MLLM agents for better accuracy.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-round dynamic collaboration of MLLM agents
Effective retrieval scheme for long videos
Dynamic agent team optimization based on performance
🔎 Similar Papers
No similar papers found.
Boyu Chen
Boyu Chen
The University of Sydney
Neural Architecture SearchTransformer
Zhengrong Yue
Zhengrong Yue
Shanghai Jiao Tong University, PhD
Unified Multimodal ModelingVideo UnderstandingVideo Generation
Siran Chen
Siran Chen
University of Chinese Academy of Science
semiconductor,AI model
Zikang Wang
Zikang Wang
Institute of Automation, Chinese Academy of Sciences
Y
Yang Liu
3Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China; 4Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China; 5Shanghai Artificial Intelligence Laboratory
P
Peng Li
3Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China; 5Shanghai Artificial Intelligence Laboratory
Y
Yali Wang
1Shenzhen Key Lab of Computer Vision and Pattern Recognition, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences; 5Shanghai Artificial Intelligence Laboratory