Symphony: A Cognitively-Inspired Multi-Agent System for Long-Video Understanding

📅 2026-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language model (MLLM) agents often struggle with high-information-density, long-sequence video understanding tasks due to coarse task decomposition, shallow collaboration mechanisms, and susceptibility to critical information loss. Inspired by human cognitive processes, this work proposes Symphony, a multi-agent system that integrates fine-grained task decomposition, a reflection-enhanced deep collaboration mechanism, and visual-language model (VLM)-driven assessment of video segment relevance to effectively model complex intentions and long-range temporal dependencies. The proposed approach achieves state-of-the-art performance across multiple benchmarks—including LVBench, LongVideoBench, VideoMME, and MLVU—with a notable 5.0% improvement over the previous best method on LVBench.

Technology Category

Application Category

📝 Abstract
Despite rapid developments and widespread applications of MLLM agents, they still struggle with long-form video understanding (LVU) tasks, which are characterized by high information density and extended temporal spans. Recent research on LVU agents demonstrates that simple task decomposition and collaboration mechanisms are insufficient for long-chain reasoning tasks. Moreover, directly reducing the time context through embedding-based retrieval may lose key information of complex problems. In this paper, we propose Symphony, a multi-agent system, to alleviate these limitations. By emulating human cognition patterns, Symphony decomposes LVU into fine-grained subtasks and incorporates a deep reasoning collaboration mechanism enhanced by reflection, effectively improving the reasoning capability. Additionally, Symphony provides a VLM-based grounding approach to analyze LVU tasks and assess the relevance of video segments, which significantly enhances the ability to locate complex problems with implicit intentions and large temporal spans. Experimental results show that Symphony achieves state-of-the-art performance on LVBench, LongVideoBench, VideoMME, and MLVU, with a 5.0% improvement over the prior state-of-the-art method on LVBench. Code is available at https://github.com/Haiyang0226/Symphony.
Problem

Research questions and friction points this paper is trying to address.

long-video understanding
multi-agent system
temporal reasoning
information density
complex task decomposition
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent system
long-video understanding
cognitive-inspired reasoning
reflection-based collaboration
VLM grounding
🔎 Similar Papers
No similar papers found.
H
Haiyang Yan
Institute of Automation, Chinese Academy of Sciences; School of Future Technology, University of Chinese Academy of Sciences
Hongyun Zhou
Hongyun Zhou
Harbin Institute of Technology Master
PEFTMachine translationLLM
Peng Xu
Peng Xu
Bytedance
Large Recommendation System
X
Xiaoxue Feng
Kuaishou Technology
Mengyi Liu
Mengyi Liu
PhD, Institute of Computing Technology, Chinese Academy of Sciences
computer vision and pattern recognition