Video-STR: Reinforcing MLLMs in Video Spatio-Temporal Reasoning with Relation Graph

📅 2025-10-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) struggle to accurately model the spatial configurations and dynamic motion relationships among multiple objects in video spatiotemporal reasoning—particularly neglecting physical constraints—thereby limiting their applicability in high-precision domains such as embodied intelligence and VR. To address this, we propose a graph-guided spatiotemporal reasoning framework: (1) a relational graph explicitly encodes spatial, temporal, and physical interactions among objects; (2) a graph-based grouped relative policy optimization method, augmented with verifiable reward-based reinforcement learning (GRPO), enables topology-aware reasoning during inference; and (3) we introduce STV-205k, the first large-scale video understanding dataset specifically designed for dynamic multi-object physical relations (205K question-answer pairs). Our approach achieves a 13% improvement over state-of-the-art baselines on STI-Bench and significantly enhances spatiotemporal reasoning accuracy.

Technology Category

Application Category

📝 Abstract
Recent progress in Multimodal Large Language Models (MLLMs) has demonstrated strong semantic understanding capabilities, but struggles to perform precise spatio-temporal understanding. Existing spatio-temporal methods primarily focus on the video itself, while overlooking the physical information within the video, such as multi-object layouts and motion. Such limitations restrict the use of MLLMs in downstream applications that demand high precision, including embodied intelligence and VR. To address this issue, we present Video-STR, a novel graph-based reinforcement method for precise Video Spatio-Temporal Reasoning. Building upon the capacity of Reinforcement Learning with Verifiable Reward (RLVR) to improve model abilities, we introduce a reasoning mechanism using graph-based Group Relative Policy Optimization (GRPO) method to guide the model in inferring the underlying spatio-temporal topology of scenarios during the thinking process. To resolve the lack of spatio-temporal training data, we construct the STV-205k dataset with 205k question-answering pairs, covering dynamic multi-object scenes in both indoor and outdoor environments, to support the model training. Experiments show that Video-STR achieves state-of-the-art results on various benchmarks, outperforming the base model by 13% on STI-Bench, and demonstrating the effectiveness of our approach and dataset. Code, model, and data will be released.
Problem

Research questions and friction points this paper is trying to address.

Enhancing MLLMs' precise video spatio-temporal reasoning capabilities
Addressing limitations in understanding multi-object layouts and motion
Improving performance for embodied intelligence and VR applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph-based reinforcement method for video reasoning
Group Relative Policy Optimization for spatio-temporal topology
Constructed STV-205k dataset for training data
🔎 Similar Papers
No similar papers found.
W
Wentao Wang
ByteDance
Heqing Zou
Heqing Zou
NTU
deep learning
T
Tianze Luo
ByteDance
G
Guiyang Xie
ByteDance
R
Rui Huang
NUS
Y
Yutian Zhao
NUS
Z
Zhuochen Wang
ByteDance
H
Hansheng Zhang
ByteDance
Chengwei Qin
Chengwei Qin
HKUST(GZ), NTU
LLMNLP
Y
Yan Wang
THU
L
Lin Zhao
NUS
H
Huaijian Zhang
ByteDance