MUPA: Towards Multi-Path Agentic Reasoning for Grounded Video Question Answering

📅 2025-06-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the problem of language priors dominating video question answering (VQA) models while neglecting visual evidence, this paper proposes a multi-agent collaborative reasoning framework. It jointly models a temporally aligned localization agent and a QA agent, and introduces a reflective agent that critically evaluates and fuses cross-path outputs, thereby tightly integrating answer prediction with visual grounding. Built upon a 2B–7B-parameter multimodal architecture, the framework incorporates video grounding, multi-step reasoning, and reflective aggregation modules. On NExT-GQA and DeVE-QA, it achieves 30.3% and 47.4% Acc@GQA, respectively—outperforming all existing 7B-scale models and establishing new state-of-the-art results. Moreover, it significantly improves grounding fidelity and prediction interpretability.

Technology Category

Application Category

📝 Abstract
Grounded Video Question Answering (Grounded VideoQA) requires aligning textual answers with explicit visual evidence. However, modern multimodal models often rely on linguistic priors and spurious correlations, resulting in poorly grounded predictions. In this work, we propose MUPA, a cooperative MUlti-Path Agentic approach that unifies video grounding, question answering, answer reflection and aggregation to tackle Grounded VideoQA. MUPA features three distinct reasoning paths on the interplay of grounding and QA agents in different chronological orders, along with a dedicated reflection agent to judge and aggregate the multi-path results to accomplish consistent QA and grounding. This design markedly improves grounding fidelity without sacrificing answer accuracy. Despite using only 2B parameters, our method outperforms all 7B-scale competitors. When scaled to 7B parameters, MUPA establishes new state-of-the-art results, with Acc@GQA of 30.3% and 47.4% on NExT-GQA and DeVE-QA respectively, demonstrating MUPA' effectiveness towards trustworthy video-language understanding. Our code is available in https://github.com/longmalongma/MUPA.
Problem

Research questions and friction points this paper is trying to address.

Improves grounding fidelity in video question answering
Reduces reliance on linguistic priors and spurious correlations
Unifies video grounding, QA, reflection, and aggregation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-path agentic reasoning for video QA
Unified video grounding and QA agents
Reflection agent aggregates multi-path results
🔎 Similar Papers
No similar papers found.
J
Jisheng Dang
Lanzhou University, National University of Singapore
H
Huilin Song
Sun Yat-sen University
Junbin Xiao
Junbin Xiao
National University of Singapore
Video and LanguageEmbodied InteractionTrustworthy Multimodality
B
Bimei Wang
Jinan University
H
Han Peng
Nanyang Technological University
H
Haoxuan Li
Peking University
X
Xun Yang
University of Science and Technology of China
M
Meng Wang
Hefei University of Technology
T
Tat-Seng Chua
National University of Singapore