Motion-o: Trajectory-Grounded Video Reasoning

📅 2026-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of explicit modeling of object motion trajectories in existing video reasoning methods, which hinders the verification of motion patterns in temporal observations. It formalizes spatio-temporal-trajectory (STT) reasoning for the first time, introducing an explicit trajectory representation and a Motion Chain of Thought (MCoT) reasoning pathway. To provide strong supervision, the authors construct a trajectory-annotated dataset and propose a motion-aware training mechanism that requires no architectural modifications, along with trajectory-level bounding box tracking and a vision-evidence-based reward function. Experiments demonstrate that the approach significantly improves performance in spatio-temporal localization and trajectory prediction, while remaining fully compatible with existing video understanding frameworks, thereby validating the critical role of motion reasoning in evidence-driven video comprehension.

Technology Category

Application Category

📝 Abstract
Recent research has made substantial progress on video reasoning, with many models leveraging spatio-temporal evidence chains to strengthen their inference capabilities. At the same time, a growing set of datasets and benchmarks now provides structured annotations designed to support and evaluate such reasoning. However, little attention has been paid to reasoning about \emph{how} objects move between observations: no prior work has articulated the motion patterns by connecting successive observations, leaving trajectory understanding implicit and difficult to verify. We formalize this missing capability as Spatial-Temporal-Trajectory (STT) reasoning and introduce \textbf{Motion-o}, a motion-centric video understanding extension to visual language models that makes trajectories explicit and verifiable. To enable motion reasoning, we also introduce a trajectory-grounding dataset artifact that expands sparse keyframe supervision via augmentation to yield denser bounding box tracks and a stronger trajectory-level training signal. Finally, we introduce Motion Chain of Thought (MCoT), a structured reasoning pathway that makes object trajectories through discrete \texttt{<motion/>} tag summarizing per-object direction, speed, and scale (of velocity) change to explicitly connect grounded observations into trajectories. To train Motion-o, we design a reward function that compels the model to reason directly over visual evidence, all while requiring no architectural modifications. Empirical results demonstrate that Motion-o improves spatial-temporal grounding and trajectory prediction while remaining fully compatible with existing frameworks, establishing motion reasoning as a critical extension for evidence-based video understanding. Code is available at https://github.com/ostadabbas/Motion-o.
Problem

Research questions and friction points this paper is trying to address.

video reasoning
object trajectory
motion patterns
spatio-temporal grounding
trajectory understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

trajectory-grounded reasoning
motion-centric video understanding
Motion Chain of Thought
visual language models
spatio-temporal grounding
🔎 Similar Papers
No similar papers found.