🤖 AI Summary
Current vision-language models (VLMs) exhibit limited capability in spatiotemporal reasoning—particularly in kinematic analysis of object motion (e.g., distance, velocity, direction). To address this, we propose ST-VLM: a novel framework for spatiotemporal visual reasoning. First, we introduce STKit, the first large-scale video dataset with fine-grained 3D motion annotations, and STKit-Bench, a dedicated benchmark for evaluating spatiotemporal reasoning. Second, we design a 4D scene reconstruction–based pipeline to automatically generate high-fidelity pseudo-labels for motion kinematics, alleviating the scarcity of 3D-annotated videos. Third, we propose kinematic instruction tuning—a targeted fine-tuning strategy that enhances multi-step dynamic relational reasoning. Experiments demonstrate that ST-VLM achieves significant gains on STKit-Bench over state-of-the-art VLMs and attains new SOTA performance on cross-domain benchmarks including ActivityNet and TVQA+, validating its generalizability and practical utility.
📝 Abstract
Spatio-temporal reasoning is essential in understanding real-world environments in various fields, eg, autonomous driving and sports analytics. Recent advances have improved the spatial reasoning ability of Vision-Language Models (VLMs) by introducing large-scale data, but these models still struggle to analyze kinematic elements like traveled distance and speed of moving objects. To bridge this gap, we construct a spatio-temporal reasoning dataset and benchmark involving kinematic instruction tuning, referred to as STKit and STKit-Bench. They consist of real-world videos with 3D annotations, detailing object motion dynamics: traveled distance, speed, movement direction, inter-object distance comparisons, and relative movement direction. To further scale such data construction to videos without 3D labels, we propose an automatic pipeline to generate pseudo-labels using 4D reconstruction in real-world scale. With our kinematic instruction tuning data for spatio-temporal reasoning, we present ST-VLM, a VLM enhanced for spatio-temporal reasoning, which exhibits outstanding performance on STKit-Bench. Furthermore, we show that ST-VLM generalizes robustly across diverse domains and tasks, outperforming baselines on other spatio-temporal benchmarks (eg, ActivityNet, TVQA+). Finally, by integrating learned spatio-temporal reasoning with existing abilities, ST-VLM enables complex multi-step reasoning. Project page: https://ikodoh.github.io/ST-VLM.