🤖 AI Summary
Existing VideoQA models rely on shallow supervision signals from isolated question-answer pairs, limiting their ability to model the narrative logic and causal structure of video events. To address this, we propose a question-driven narrativized supervision paradigm: leveraging Question-Based Paraphrasing (QBP) and Question-Based Captioning (QBC), we reconstruct discrete QA pairs into coherent narrative paragraphs grounded in fine-grained visual evidence. The resulting narratives are trained end-to-end within a unified next-token prediction framework. This approach elevates video understanding supervision from a “collection of facts” to a “structured narrative” for the first time, substantially enhancing models’ capacity to capture deep event semantics. Our method achieves new state-of-the-art results on STAR and NExT-QA: a 3B-parameter model improves accuracy on STAR by 4.9 points to 72.5%, while a 7B model attains 80.8% on NExT-QA. It also demonstrates improved cross-dataset generalization and faster training convergence.
📝 Abstract
The performance of Video Question Answering (VideoQA) models is fundamentally constrained by the nature of their supervision, which typically consists of isolated, factual question-answer pairs. This "bag-of-facts" approach fails to capture the underlying narrative and causal structure of events, limiting models to a shallow understanding of video content. To move beyond this paradigm, we introduce a framework to synthesize richer supervisory signals. We propose two complementary strategies: Question-Based Paraphrasing (QBP), which synthesizes the diverse inquiries (what, how, why) from a video's existing set of question-answer pairs into a holistic narrative paragraph that reconstructs the video's event structure; and Question-Based Captioning (QBC), which generates fine-grained visual rationales, grounding the answer to each question in specific, relevant evidence. Leveraging powerful generative models, we use this synthetic data to train VideoQA models under a unified next-token prediction objective. Extensive experiments on STAR and NExT-QA validate our approach, demonstrating significant accuracy gains and establishing new state-of-the-art results, such as improving a 3B model to 72.5% on STAR (+4.9%) and a 7B model to 80.8% on NExT-QA. Beyond accuracy, our analysis reveals that both QBP and QBC substantially enhance cross-dataset generalization, with QBP additionally accelerating model convergence by over 2.5x. These results demonstrate that shifting data synthesis from isolated facts to narrative coherence and grounded rationales yields a more accurate, efficient, and generalizable training paradigm.