SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability

📅 2025-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) face two key bottlenecks in video spatiotemporal grounding: inaccurate frame-level spatiotemporal feature extraction and misalignment between visual tokens and spatial coordinates. To address these, we propose Uni-STG—a unified framework introducing a novel spatiotemporal-aware interleaved query mechanism and a query-guided spatial decoder for fine-grained, frame-level spatiotemporal coordinate alignment. We further construct Uni-STG, the first large-scale unified spatiotemporal grounding dataset (480K samples), and introduce video frame feature alignment alongside multi-task joint training. Evaluated across 11 diverse benchmarks—including temporal localization, spatial detection, spatiotemporal grounding, and general video understanding—Uni-STG achieves state-of-the-art performance on all, significantly advancing MLLMs’ spatiotemporal joint reasoning capability.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) have made remarkable progress in either temporal or spatial localization. However, they struggle to perform spatio-temporal video grounding. This limitation stems from two major challenges. Firstly, it is difficult to extract accurate spatio-temporal information of each frame in the video. Secondly, the substantial number of visual tokens makes it challenging to precisely map visual tokens of each frame to their corresponding spatial coordinates. To address these issues, we introduce SpaceVLLM, a MLLM endowed with spatio-temporal video grounding capability. Specifically, we adopt a set of interleaved Spatio-Temporal Aware Queries to capture temporal perception and dynamic spatial information. Moreover, we propose a Query-Guided Space Decoder to establish a corresponding connection between the queries and spatial coordinates. Additionally, due to the lack of spatio-temporal datasets, we construct the Unified Spatio-Temporal Grounding (Uni-STG) dataset, comprising 480K instances across three tasks. This dataset fully exploits the potential of MLLM to simultaneously facilitate localization in both temporal and spatial dimensions. Extensive experiments demonstrate that SpaceVLLM achieves the state-of-the-art performance across 11 benchmarks covering temporal, spatial, spatio-temporal and video understanding tasks, highlighting the effectiveness of our approach. Our code, datasets and model will be released.
Problem

Research questions and friction points this paper is trying to address.

Enhance spatio-temporal video grounding in MLLMs.
Address challenges in extracting accurate spatio-temporal information.
Develop a dataset to support spatio-temporal localization tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatio-Temporal Aware Queries for video grounding
Query-Guided Space Decoder for spatial mapping
Unified Spatio-Temporal Grounding dataset creation
J
Jiankang Wang
University of Science and Technology of China
Zhihan Zhang
Zhihan Zhang
PhD student, University of Notre Dame
Natural Language Processing
Z
Zhihang Liu
University of Science and Technology of China
Y
Yang Li
Renmin University of China
Jiannan Ge
Jiannan Ge
University of Science and Technology of China
Zero-shot LearningOpen-vocabulary SegmentationMulti-modal Learning
H
Hongtao Xie
University of Science and Technology of China
Y
Yongdong Zhang
University of Science and Technology of China