Beyond Isolated Facts: Synthesizing Narrative and Grounded Supervision for VideoQA

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Existing VideoQA models rely on shallow supervision signals from isolated question-answer pairs, limiting their ability to model the narrative logic and causal structure of video events. To address this, we propose a question-driven narrativized supervision paradigm: leveraging Question-Based Paraphrasing (QBP) and Question-Based Captioning (QBC), we reconstruct discrete QA pairs into coherent narrative paragraphs grounded in fine-grained visual evidence. The resulting narratives are trained end-to-end within a unified next-token prediction framework. This approach elevates video understanding supervision from a “collection of facts” to a “structured narrative” for the first time, substantially enhancing models’ capacity to capture deep event semantics. Our method achieves new state-of-the-art results on STAR and NExT-QA: a 3B-parameter model improves accuracy on STAR by 4.9 points to 72.5%, while a 7B model attains 80.8% on NExT-QA. It also demonstrates improved cross-dataset generalization and faster training convergence.

Technology Category

Application Category

📝 Abstract

The performance of Video Question Answering (VideoQA) models is fundamentally constrained by the nature of their supervision, which typically consists of isolated, factual question-answer pairs. This "bag-of-facts" approach fails to capture the underlying narrative and causal structure of events, limiting models to a shallow understanding of video content. To move beyond this paradigm, we introduce a framework to synthesize richer supervisory signals. We propose two complementary strategies: Question-Based Paraphrasing (QBP), which synthesizes the diverse inquiries (what, how, why) from a video's existing set of question-answer pairs into a holistic narrative paragraph that reconstructs the video's event structure; and Question-Based Captioning (QBC), which generates fine-grained visual rationales, grounding the answer to each question in specific, relevant evidence. Leveraging powerful generative models, we use this synthetic data to train VideoQA models under a unified next-token prediction objective. Extensive experiments on STAR and NExT-QA validate our approach, demonstrating significant accuracy gains and establishing new state-of-the-art results, such as improving a 3B model to 72.5% on STAR (+4.9%) and a 7B model to 80.8% on NExT-QA. Beyond accuracy, our analysis reveals that both QBP and QBC substantially enhance cross-dataset generalization, with QBP additionally accelerating model convergence by over 2.5x. These results demonstrate that shifting data synthesis from isolated facts to narrative coherence and grounded rationales yields a more accurate, efficient, and generalizable training paradigm.

Problem

Research questions and friction points this paper is trying to address.

VideoQA models lack narrative understanding due to isolated factual supervision

Current approaches fail to capture causal event structure in videos

Models need better grounding in visual evidence and holistic reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthesizes narrative paragraphs from question-answer pairs

Generates fine-grained visual rationales for grounding answers

Trains models using synthetic data for next-token prediction

🔎 Similar Papers

Uncovering Hidden Connections: Iterative Search and Reasoning for Video-grounded Dialog