Towards Fine-Grained Video Question Answering

📅 2025-03-10

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Existing VideoQA datasets suffer from coarse spatiotemporal granularity, lacking fine-grained localization capability and explicit modeling of multi-object relational dynamics—limiting deep video understanding. To address this, we introduce MOMA-QA, the first fine-grained video question-answering benchmark supporting spatiotemporal precise localization, multi-object relational reasoning, and entity-centric querying. We further propose SGVLM, a novel LLM-driven end-to-end VideoQA framework that unifies explicit scene graph modeling, sparse frame retrieval, and temporal interval prediction for the first time. Evaluated on MOMA-QA and multiple mainstream benchmarks, SGVLM achieves substantial improvements over state-of-the-art methods: +12.7% in spatiotemporal localization accuracy and +9.3% in relational reasoning F1-score. Our work establishes a new benchmark for fine-grained vision-language understanding in videos.

Technology Category

Application Category

📝 Abstract

In the rapidly evolving domain of video understanding, Video Question Answering (VideoQA) remains a focal point. However, existing datasets exhibit gaps in temporal and spatial granularity, which consequently limits the capabilities of existing VideoQA methods. This paper introduces the Multi-Object Multi-Actor Question Answering (MOMA-QA) dataset, which is designed to address these shortcomings by emphasizing temporal localization, spatial relationship reasoning, and entity-centric queries. With ground truth scene graphs and temporal interval annotations, MOMA-QA is ideal for developing models for fine-grained video understanding. Furthermore, we present a novel video-language model, SGVLM, which incorporates a scene graph predictor, an efficient frame retriever, and a pre-trained large language model for temporal localization and fine-grained relationship understanding. Evaluations on MOMA-QA and other public datasets demonstrate the superior performance of our model, setting new benchmarks for VideoQA.

Problem

Research questions and friction points this paper is trying to address.

Addresses gaps in temporal and spatial granularity in VideoQA datasets.

Introduces MOMA-QA for fine-grained video understanding with scene graphs.

Proposes SGVLM model for temporal localization and relationship reasoning.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MOMA-QA dataset for fine-grained VideoQA

Develops SGVLM model with scene graph predictor

Combines frame retriever and pre-trained language model

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding