Scaling Video Understanding via Compact Latent Multi-Agent Collaboration

📅 2026-05-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

241K/year
🤖 AI Summary
Existing multimodal large language models face limitations in long-form video understanding due to constrained perceptual context budgets, and conventional preprocessing approaches often suffer from information loss while relying heavily on textual intermediaries. To address this, this work proposes MACF, an end-to-end multi-agent collaborative framework that partitions videos into segments processed in parallel by individual agents operating under local budget constraints. MACF decouples the per-agent perceptual budget from the overall video complexity through compact task-sufficient tokens in a shared embedding space and an implicit communication mechanism orchestrated by a central coordinator. Coupled with a progressive curriculum training strategy, MACF significantly outperforms current methods across multiple video understanding benchmarks, achieving superior performance and enhanced scalability under identical computational budgets.
📝 Abstract
Multi-modal large language models (MLLMs) advance vision language understanding but face inherent limitations in long-video tasks due to bounded perception context budgets. Existing agentic methods mitigate this via rule-based preprocessing, yet often suffer from information loss, high cost, and reliance on textual intermediates. We propose MACF, an end-to-end Multi-Agent Collaboration Framework that decouples per-agent perception budgets from global video complexity, enabling scalable video understanding while preserving visual fidelity. MACF partitions videos into segments for locally budgeted agents and enables holistic reasoning via an agent-native latent communication protocol. Each agent encodes partial observations into compact, task-sufficient tokens in a shared embedding space, allowing efficient and information-preserving collaboration by a central coordinator. We introduce a curriculum training strategy that progressively enforces semantic alignment, evidence summarization, and cross-agent coordination. Extensive experiments on diverse video understanding benchmarks show that MACF consistently outperforms state-of-the-art MLLMs and multi-agent systems under identical budget constraints, demonstrating the effectiveness of our latent collaboration for scalable video understanding.
Problem

Research questions and friction points this paper is trying to address.

video understanding
perception context budget
multi-agent collaboration
long-video tasks
information loss
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Agent Collaboration
Latent Communication
Video Understanding
Perception Budget Decoupling
Compact Token Encoding