ReMoT: Reinforcement Learning with Motion Contrast Triplets

📅 2026-02-27

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work addresses a fundamental limitation of current vision-language models (VLMs)—their poor spatiotemporal consistency in discerning fine-grained motion attributes critical for applications such as navigation, robotics, and autonomous driving. To this end, the authors propose ReMoT, a unified training paradigm that introduces a rule-based method to automatically construct ReMoT-16K, a large-scale dataset of 16.5K motion contrastive triplets. They further design a Group Relative Policy Optimization algorithm that substantially outperforms conventional supervised fine-tuning. Additionally, they establish the first fine-grained motion contrastive evaluation benchmark. Experimental results demonstrate that ReMoT achieves state-of-the-art performance on this new benchmark as well as multiple standard VLM tasks, with up to a 25.1% improvement in spatiotemporal reasoning capability.

Technology Category

Application Category

📝 Abstract

We present ReMoT, a unified training paradigm to systematically address the fundamental shortcomings of VLMs in spatio-temporal consistency -- a critical failure point in navigation, robotics, and autonomous driving. ReMoT integrates two core components: (1) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based generation. (2) Group Relative Policy Optimization, which we empirically validate yields optimal performance and data efficiency for learning this contrastive reasoning, far exceeding standard Supervised Fine-Tuning. We also construct the first benchmark for fine-grained motion contrast triplets to measure a VLM's discrimination of subtle motion attributes (e.g., opposing directions). The resulting model achieves state-of-the-art performance on our new benchmark and multiple standard VLM benchmarks, culminating in a remarkable 25.1% performance leap on spatio-temporal reasoning tasks.

Problem

Research questions and friction points this paper is trying to address.

spatio-temporal consistency

visual language models

motion contrast

temporal reasoning

autonomous driving

Innovation

Methods, ideas, or system contributions that make the work stand out.

ReMoT

motion contrast triplets

Group Relative Policy Optimization