Object-centric Video Question Answering with Visual Grounding and Referring

📅 2025-07-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Video Large Language Models (VideoLLMs) are limited to text-only outputs and high-level semantic understanding, hindering object-level, multi-turn interactive video reasoning. This work introduces the first VideoLLM supporting both visual input and output, enabling object-centric fine-grained spatiotemporal reasoning. Our method integrates visual grounding, object localization, spatiotemporal modeling, and instruction tuning to achieve cross-frame object perception and multimodal joint reasoning. Key contributions include: (1) a SpatioTemporal Overlay Module (STOM) that enables visual prompting propagation across frames at arbitrary timestamps; (2) VideoInfer—the first object-oriented video question answering benchmark; and (3) comprehensive evaluation demonstrating significant improvements over state-of-the-art methods across six tasks on twelve benchmarks, with particularly notable gains in video question answering and referring expression segmentation.

Technology Category

Application Category

📝 Abstract
Video Large Language Models (VideoLLMs) have recently demonstrated remarkable progress in general video understanding. However, existing models primarily focus on high-level comprehension and are limited to text-only responses, restricting the flexibility for object-centric, multiround interactions. In this paper, we make three contributions: (i) we address these limitations by introducing a VideoLLM model, capable of performing both object referring for input and grounding for output in video reasoning tasks, i.e., allowing users to interact with videos using both textual and visual prompts; (ii) we propose STOM (Spatial-Temporal Overlay Module), a novel approach that propagates arbitrary visual prompts input at any single timestamp to the remaining frames within a video; (iii) we present VideoInfer, a manually curated object-centric video instruction dataset featuring questionanswering pairs that require reasoning. We conduct comprehensive experiments on VideoInfer and other existing benchmarks across video question answering and referring object segmentation. The results on 12 benchmarks of 6 tasks show that our proposed model consistently outperforms baselines in both video question answering and segmentation, underscoring its robustness in multimodal, object-centric video and image understanding. Project page: https://qirui-chen.github.io/RGA3-release/.
Problem

Research questions and friction points this paper is trying to address.

Enabling object-centric video interactions with multimodal prompts
Propagating visual prompts across video frames efficiently
Improving video QA and segmentation via robust object understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

VideoLLM with object referring and grounding
STOM for spatial-temporal prompt propagation
VideoInfer dataset for object-centric QA
🔎 Similar Papers
No similar papers found.
H
Haochen Wang
University of Amsterdam
Qirui Chen
Qirui Chen
Shanghai Jiao Tong University
C
Cilin Yan
Xiaohongshu Inc.
J
Jiayin Cai
Xiaohongshu Inc.
X
Xiaolong Jiang
Xiaohongshu Inc.
Yao Hu
Yao Hu
浙江大学
Machine Learning
Weidi Xie
Weidi Xie
Shanghai Jiao Tong University | VGG, University of Oxford
Computer VisionAI for HealthcareAI for Science
S
Stratis Gavves
University of Amsterdam