ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models

📅 2025-03-25

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Current vision-language models (VLMs) exhibit limited capability in spatiotemporal reasoning—particularly in kinematic analysis of object motion (e.g., distance, velocity, direction). To address this, we propose ST-VLM: a novel framework for spatiotemporal visual reasoning. First, we introduce STKit, the first large-scale video dataset with fine-grained 3D motion annotations, and STKit-Bench, a dedicated benchmark for evaluating spatiotemporal reasoning. Second, we design a 4D scene reconstruction–based pipeline to automatically generate high-fidelity pseudo-labels for motion kinematics, alleviating the scarcity of 3D-annotated videos. Third, we propose kinematic instruction tuning—a targeted fine-tuning strategy that enhances multi-step dynamic relational reasoning. Experiments demonstrate that ST-VLM achieves significant gains on STKit-Bench over state-of-the-art VLMs and attains new SOTA performance on cross-domain benchmarks including ActivityNet and TVQA+, validating its generalizability and practical utility.

Technology Category

Application Category

📝 Abstract

Spatio-temporal reasoning is essential in understanding real-world environments in various fields, eg, autonomous driving and sports analytics. Recent advances have improved the spatial reasoning ability of Vision-Language Models (VLMs) by introducing large-scale data, but these models still struggle to analyze kinematic elements like traveled distance and speed of moving objects. To bridge this gap, we construct a spatio-temporal reasoning dataset and benchmark involving kinematic instruction tuning, referred to as STKit and STKit-Bench. They consist of real-world videos with 3D annotations, detailing object motion dynamics: traveled distance, speed, movement direction, inter-object distance comparisons, and relative movement direction. To further scale such data construction to videos without 3D labels, we propose an automatic pipeline to generate pseudo-labels using 4D reconstruction in real-world scale. With our kinematic instruction tuning data for spatio-temporal reasoning, we present ST-VLM, a VLM enhanced for spatio-temporal reasoning, which exhibits outstanding performance on STKit-Bench. Furthermore, we show that ST-VLM generalizes robustly across diverse domains and tasks, outperforming baselines on other spatio-temporal benchmarks (eg, ActivityNet, TVQA+). Finally, by integrating learned spatio-temporal reasoning with existing abilities, ST-VLM enables complex multi-step reasoning. Project page: https://ikodoh.github.io/ST-VLM.

Problem

Research questions and friction points this paper is trying to address.

Enhance VLMs for kinematic element analysis like speed and distance

Create dataset and benchmark for spatio-temporal reasoning in VLMs

Improve generalization of VLMs across diverse domains and tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Kinematic instruction tuning for spatio-temporal reasoning

Automatic pseudo-label generation using 4D reconstruction

Enhanced VLM for robust multi-step reasoning

🔎 Similar Papers

No similar papers found.