ReMoT: Reinforcement Learning with Motion Contrast Triplets

πŸ“… 2026-02-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses a fundamental limitation of current vision-language models (VLMs)β€”their poor spatiotemporal consistency in discerning fine-grained motion attributes critical for applications such as navigation, robotics, and autonomous driving. To this end, the authors propose ReMoT, a unified training paradigm that introduces a rule-based method to automatically construct ReMoT-16K, a large-scale dataset of 16.5K motion contrastive triplets. They further design a Group Relative Policy Optimization algorithm that substantially outperforms conventional supervised fine-tuning. Additionally, they establish the first fine-grained motion contrastive evaluation benchmark. Experimental results demonstrate that ReMoT achieves state-of-the-art performance on this new benchmark as well as multiple standard VLM tasks, with up to a 25.1% improvement in spatiotemporal reasoning capability.

Technology Category

Application Category

πŸ“ Abstract
We present ReMoT, a unified training paradigm to systematically address the fundamental shortcomings of VLMs in spatio-temporal consistency -- a critical failure point in navigation, robotics, and autonomous driving. ReMoT integrates two core components: (1) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based generation. (2) Group Relative Policy Optimization, which we empirically validate yields optimal performance and data efficiency for learning this contrastive reasoning, far exceeding standard Supervised Fine-Tuning. We also construct the first benchmark for fine-grained motion contrast triplets to measure a VLM's discrimination of subtle motion attributes (e.g., opposing directions). The resulting model achieves state-of-the-art performance on our new benchmark and multiple standard VLM benchmarks, culminating in a remarkable 25.1% performance leap on spatio-temporal reasoning tasks.
Problem

Research questions and friction points this paper is trying to address.

spatio-temporal consistency
visual language models
motion contrast
temporal reasoning
autonomous driving
Innovation

Methods, ideas, or system contributions that make the work stand out.

ReMoT
motion contrast triplets
Group Relative Policy Optimization
spatio-temporal consistency
visual language models
πŸ”Ž Similar Papers
No similar papers found.
Cong Wan
Cong Wan
Xian Jiaotong University
AIGC3Ddiffusion
Zeyu Guo
Zeyu Guo
The Ohio State University
Theoretical computer science
J
Jiangyang Li
Xi’an Jiaotong University
S
SongLin Dong
Faculty of Computility Microelectronics, Shenzhen University of Advanced Technology
Yifan Bai
Yifan Bai
Alibaba DAMO Academy
Embodied IntelligenceAutonomous DrivingVisual GenerationAI for Medicine
L
Lin Peng
Xi’an Jiaotong University
Z
Zhiheng Ma
Faculty of Computility Microelectronics, Shenzhen University of Advanced Technology
Yihong Gong
Yihong Gong
Xi'an Jiaotong University
Multimedia content analysisMachine learningPattern recognition