🤖 AI Summary
Existing VideoQA datasets suffer from coarse spatiotemporal granularity, lacking fine-grained localization capability and explicit modeling of multi-object relational dynamics—limiting deep video understanding. To address this, we introduce MOMA-QA, the first fine-grained video question-answering benchmark supporting spatiotemporal precise localization, multi-object relational reasoning, and entity-centric querying. We further propose SGVLM, a novel LLM-driven end-to-end VideoQA framework that unifies explicit scene graph modeling, sparse frame retrieval, and temporal interval prediction for the first time. Evaluated on MOMA-QA and multiple mainstream benchmarks, SGVLM achieves substantial improvements over state-of-the-art methods: +12.7% in spatiotemporal localization accuracy and +9.3% in relational reasoning F1-score. Our work establishes a new benchmark for fine-grained vision-language understanding in videos.
📝 Abstract
In the rapidly evolving domain of video understanding, Video Question Answering (VideoQA) remains a focal point. However, existing datasets exhibit gaps in temporal and spatial granularity, which consequently limits the capabilities of existing VideoQA methods. This paper introduces the Multi-Object Multi-Actor Question Answering (MOMA-QA) dataset, which is designed to address these shortcomings by emphasizing temporal localization, spatial relationship reasoning, and entity-centric queries. With ground truth scene graphs and temporal interval annotations, MOMA-QA is ideal for developing models for fine-grained video understanding. Furthermore, we present a novel video-language model, SGVLM, which incorporates a scene graph predictor, an efficient frame retriever, and a pre-trained large language model for temporal localization and fine-grained relationship understanding. Evaluations on MOMA-QA and other public datasets demonstrate the superior performance of our model, setting new benchmarks for VideoQA.