MoReVQA: Exploring Modular Reasoning Models for Video Question Answering

📅 2024-04-09
🏛️ Computer Vision and Pattern Recognition
📈 Citations: 16
Influential: 4
📄 PDF
🤖 AI Summary
Existing single-stage planning methods for video question answering (videoQA) suffer from poor robustness, weak visual grounding, and limited interpretability. To address these issues, this paper proposes a training-free, multi-stage modular reasoning framework that decomposes the task into three sequential phases: event-structure parsing, visual content grounding, and final answer inference—each implemented via few-shot prompting of large language models (LLMs) or multimodal LMs, without fine-tuning. By aligning reasoning steps with cognitive hierarchies, our approach explicitly couples high-level planning with low-level visual evidence—an innovation not achieved by prior methods. Evaluated on NExT-QA, iVQA, EgoSchema, and ActivityNet-QA, it achieves state-of-the-art performance. Moreover, it generalizes successfully to grounded videoQA and paragraph-level video captioning, demonstrating substantial improvements in generalization, robustness, and interpretability.

Technology Category

Application Category

📝 Abstract
This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular rea-soning framework. Previous modular methods have shown promise with a single planning stage ungrounded in visual content. However, through a simple and effective base-line, we find that such systems can lead to brittle behavior in practice for challenging videoQA settings. Thus, unlike traditional single-stage planning methods, we propose a multi-stage system consisting of an event parser, a grounding stage, and a final reasoning stage in conjunction with an external memory. All stages are training-free, and performed using few-shot prompting of large models, creating interpretable intermediate outputs at each stage. By decomposing the underlying planning and task complexity, our method, MoReVQA, improves over prior work on stan-dard videoQA benchmarks (NExT-QA, iVQA, EgoSchema, ActivityNet-QA) with state-of-the-art results, and extensions to related tasks (grounded videoQA, paragraph captioning).
Problem

Research questions and friction points this paper is trying to address.

Decomposing videoQA into multi-stage modular reasoning
Addressing brittleness in single-stage planning methods
Improving interpretability via training-free few-shot prompting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-stage modular reasoning framework
Training-free few-shot prompting technique
External memory enhanced interpretable outputs
🔎 Similar Papers
2024-08-08International Journal of Computer VisionCitations: 13
2024-02-20International Conference on Machine LearningCitations: 30