ReasVQA: Advancing VideoQA with Imperfect Reasoning Process

📅 2025-01-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video question answering (VideoQA) faces key bottlenecks in modeling complex spatiotemporal dynamics, reliance on high-quality manual annotations, and dependence on idealized, error-free reasoning paths. Method: This paper proposes a three-stage reasoning-enhanced paradigm: (1) generating initial reasoning chains using multimodal large language models (MLLMs); (2) selecting informative yet imperfect intermediate reasoning steps via confidence- and consistency-based filtering; and (3) end-to-end optimization via a video–language joint encoder trained with multi-task learning—jointly predicting answers and modeling reasoning paths. Contribution/Results: It is the first work to systematically leverage imperfect MLLM-generated reasoning as weak supervision for VideoQA, eliminating the need for precise annotations or perfect reasoning traces. The method achieves new state-of-the-art results on NExT-QA (+2.9%), STAR (+7.3%), and IntentQA (+5.9%), demonstrating both the effectiveness and generalizability of reasoning-aware supervision in VideoQA.

Technology Category

Application Category

📝 Abstract
Video Question Answering (VideoQA) is a challenging task that requires understanding complex visual and temporal relationships within videos to answer questions accurately. In this work, we introduce extbf{ReasVQA} (Reasoning-enhanced Video Question Answering), a novel approach that leverages reasoning processes generated by Multimodal Large Language Models (MLLMs) to improve the performance of VideoQA models. Our approach consists of three phases: reasoning generation, reasoning refinement, and learning from reasoning. First, we generate detailed reasoning processes using additional MLLMs, and second refine them via a filtering step to ensure data quality. Finally, we use the reasoning data, which might be in an imperfect form, to guide the VideoQA model via multi-task learning, on how to interpret and answer questions based on a given video. We evaluate ReasVQA on three popular benchmarks, and our results establish new state-of-the-art performance with significant improvements of +2.9 on NExT-QA, +7.3 on STAR, and +5.9 on IntentQA. Our findings demonstrate the supervising benefits of integrating reasoning processes into VideoQA. Further studies validate each component of our method, also with different backbones and MLLMs, and again highlight the advantages of this simple but effective method. We offer a new perspective on enhancing VideoQA performance by utilizing advanced reasoning techniques, setting a new benchmark in this research field.
Problem

Research questions and friction points this paper is trying to address.

VideoQA
Complex Visual Understanding
Temporal Relationships
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models
Video Question Answering
Enhanced Reasoning Process
🔎 Similar Papers
2024-08-08International Journal of Computer VisionCitations: 13
2024-04-09Computer Vision and Pattern RecognitionCitations: 16