VideoMultiAgents: A Multi-Agent Framework for Video Question Answering

📅 2025-04-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video question answering (VQA) methods predominantly rely on single-model frame-level captioning, limiting fine-grained modeling of visual, temporal, and linguistic interactions. To address this, we propose a multi-agent collaborative architecture comprising specialized agents for vision, scene graphs, and text, enabling modular, task-specific reasoning. We further introduce a query-guided dynamic captioning mechanism that selectively generates captions focused on question-relevant objects, actions, and temporal evolution. Our method integrates scene graph parsing, vision–language joint encoding, and query-driven temporal modeling. Evaluated on Intent-QA (79.0%), an EgoSchema subset (75.4%), and NExT-QA (79.6%), our approach achieves state-of-the-art performance—improving accuracy by up to 6.2% over prior work—and significantly enhances temporal understanding and multimodal collaborative reasoning in video VQA.

Technology Category

Application Category

📝 Abstract
Video Question Answering (VQA) inherently relies on multimodal reasoning, integrating visual, temporal, and linguistic cues to achieve a deeper understanding of video content. However, many existing methods rely on feeding frame-level captions into a single model, making it difficult to adequately capture temporal and interactive contexts. To address this limitation, we introduce VideoMultiAgents, a framework that integrates specialized agents for vision, scene graph analysis, and text processing. It enhances video understanding leveraging complementary multimodal reasoning from independently operating agents. Our approach is also supplemented with a question-guided caption generation, which produces captions that highlight objects, actions, and temporal transitions directly relevant to a given query, thus improving the answer accuracy. Experimental results demonstrate that our method achieves state-of-the-art performance on Intent-QA (79.0%, +6.2% over previous SOTA), EgoSchema subset (75.4%, +3.4%), and NExT-QA (79.6%, +0.4%).
Problem

Research questions and friction points this paper is trying to address.

Enhancing video understanding through multimodal reasoning agents
Improving temporal and interactive context capture in VQA
Boosting answer accuracy via question-guided caption generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent framework for multimodal video reasoning
Specialized agents for vision, scene, and text
Question-guided caption generation for accuracy
🔎 Similar Papers
No similar papers found.
N
Noriyuki Kugo
Panasonic Connect Co., Ltd.
X
Xiang Li
Stanford University
Zixin Li
Zixin Li
Stanford University
A
Ashish Gupta
Panasonic Connect Co., Ltd.
A
Arpandeep Khatua
Stanford University
N
Nidhish Jain
Stanford University
Chaitanya Patel
Chaitanya Patel
Stanford University
Computer Vision3D Vision
Y
Yuta Kyuragi
Panasonic R&D Company of America
M
Masamoto Tanabiki
Panasonic Connect Co., Ltd.
Kazuki Kozuka
Kazuki Kozuka
Panasonic Holdings Corporation, Kyoto University
foundation modellarge language model
Ehsan Adeli
Ehsan Adeli
Stanford University
Computer VisionComputational NeurosciencePrecision HealthcareAmbient Intelligence