Beyond Isolated Facts: Synthesizing Narrative and Grounded Supervision for VideoQA

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing VideoQA models rely on shallow supervision signals from isolated question-answer pairs, limiting their ability to model the narrative logic and causal structure of video events. To address this, we propose a question-driven narrativized supervision paradigm: leveraging Question-Based Paraphrasing (QBP) and Question-Based Captioning (QBC), we reconstruct discrete QA pairs into coherent narrative paragraphs grounded in fine-grained visual evidence. The resulting narratives are trained end-to-end within a unified next-token prediction framework. This approach elevates video understanding supervision from a “collection of facts” to a “structured narrative” for the first time, substantially enhancing models’ capacity to capture deep event semantics. Our method achieves new state-of-the-art results on STAR and NExT-QA: a 3B-parameter model improves accuracy on STAR by 4.9 points to 72.5%, while a 7B model attains 80.8% on NExT-QA. It also demonstrates improved cross-dataset generalization and faster training convergence.

Technology Category

Application Category

📝 Abstract
The performance of Video Question Answering (VideoQA) models is fundamentally constrained by the nature of their supervision, which typically consists of isolated, factual question-answer pairs. This "bag-of-facts" approach fails to capture the underlying narrative and causal structure of events, limiting models to a shallow understanding of video content. To move beyond this paradigm, we introduce a framework to synthesize richer supervisory signals. We propose two complementary strategies: Question-Based Paraphrasing (QBP), which synthesizes the diverse inquiries (what, how, why) from a video's existing set of question-answer pairs into a holistic narrative paragraph that reconstructs the video's event structure; and Question-Based Captioning (QBC), which generates fine-grained visual rationales, grounding the answer to each question in specific, relevant evidence. Leveraging powerful generative models, we use this synthetic data to train VideoQA models under a unified next-token prediction objective. Extensive experiments on STAR and NExT-QA validate our approach, demonstrating significant accuracy gains and establishing new state-of-the-art results, such as improving a 3B model to 72.5% on STAR (+4.9%) and a 7B model to 80.8% on NExT-QA. Beyond accuracy, our analysis reveals that both QBP and QBC substantially enhance cross-dataset generalization, with QBP additionally accelerating model convergence by over 2.5x. These results demonstrate that shifting data synthesis from isolated facts to narrative coherence and grounded rationales yields a more accurate, efficient, and generalizable training paradigm.
Problem

Research questions and friction points this paper is trying to address.

VideoQA models lack narrative understanding due to isolated factual supervision
Current approaches fail to capture causal event structure in videos
Models need better grounding in visual evidence and holistic reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthesizes narrative paragraphs from question-answer pairs
Generates fine-grained visual rationales for grounding answers
Trains models using synthetic data for next-token prediction
🔎 Similar Papers
No similar papers found.
J
Jianxin Liang
Wangxuan Institute of Computer Technology, Peking University
T
Tan Yue
Wangxuan Institute of Computer Technology, Peking University
Y
Yuxuan Wang
Wangxuan Institute of Computer Technology, Peking University
Yueqian Wang
Yueqian Wang
Peking University
Multimodal Pre-trained Models
Z
Zhihan Yin
Wangxuan Institute of Computer Technology, Peking University
Huishuai Zhang
Huishuai Zhang
Peking University
Deep LearningOptimizationInformation Theory
Dongyan Zhao
Dongyan Zhao
Peking University
Natural Language ProcessingSemantic Data ManagementQADialogue System