π€ AI Summary
This work addresses a fundamental limitation of current vision-language models (VLMs)βtheir poor spatiotemporal consistency in discerning fine-grained motion attributes critical for applications such as navigation, robotics, and autonomous driving. To this end, the authors propose ReMoT, a unified training paradigm that introduces a rule-based method to automatically construct ReMoT-16K, a large-scale dataset of 16.5K motion contrastive triplets. They further design a Group Relative Policy Optimization algorithm that substantially outperforms conventional supervised fine-tuning. Additionally, they establish the first fine-grained motion contrastive evaluation benchmark. Experimental results demonstrate that ReMoT achieves state-of-the-art performance on this new benchmark as well as multiple standard VLM tasks, with up to a 25.1% improvement in spatiotemporal reasoning capability.
π Abstract
We present ReMoT, a unified training paradigm to systematically address the fundamental shortcomings of VLMs in spatio-temporal consistency -- a critical failure point in navigation, robotics, and autonomous driving. ReMoT integrates two core components: (1) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based generation. (2) Group Relative Policy Optimization, which we empirically validate yields optimal performance and data efficiency for learning this contrastive reasoning, far exceeding standard Supervised Fine-Tuning. We also construct the first benchmark for fine-grained motion contrast triplets to measure a VLM's discrimination of subtle motion attributes (e.g., opposing directions). The resulting model achieves state-of-the-art performance on our new benchmark and multiple standard VLM benchmarks, culminating in a remarkable 25.1% performance leap on spatio-temporal reasoning tasks.