SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration

📅 2026-04-06

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the limitations of existing video question answering methods, which predominantly rely on localized frame-level reasoning and struggle to capture coherent narrative structures. To overcome this, we propose SVAgent, a novel storyline-guided multi-agent framework in which multiple agents collaboratively construct dynamic narrative representations. A meta-agent aligns predictions across visual and textual modalities to enforce cross-modal consistency, enabling holistic reasoning that mirrors human comprehension grounded in storylines. By integrating historical error analysis with a meta-decision mechanism, our approach significantly improves both performance and interpretability on standard video QA benchmarks, offering a more cognitively plausible model of narrative-based inference.

Technology Category

Application Category

📝 Abstract

Video question answering (VideoQA) is a challenging task that requires integrating spatial, temporal, and semantic information to capture the complex dynamics of video sequences. Although recent advances have introduced various approaches for video understanding, most existing methods still rely on locating relevant frames to answer questions rather than reasoning through the evolving storyline as humans do. Humans naturally interpret videos through coherent storylines, an ability that is crucial for making robust and contextually grounded predictions. To address this gap, we propose SVAgent, a storyline-guided cross-modal multi-agent framework for VideoQA. The storyline agent progressively constructs a narrative representation based on frames suggested by a refinement suggestion agent that analyzes historical failures. In addition, cross-modal decision agents independently predict answers from visual and textual modalities under the guidance of the evolving storyline. Their outputs are then evaluated by a meta-agent to align cross-modal predictions and enhance reasoning robustness and answer consistency. Experimental results demonstrate that SVAgent achieves superior performance and interpretability by emulating human-like storyline reasoning in video understanding.

Problem

Research questions and friction points this paper is trying to address.

Video Question Answering

Storyline Reasoning

Long Video Understanding

Cross-Modal Reasoning

Temporal Dynamics

Innovation

Methods, ideas, or system contributions that make the work stand out.

storyline-guided reasoning

cross-modal multi-agent collaboration

video question answering