What Happens Next? Anticipating Future Motion by Generating Point Trajectories

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This paper addresses the problem of single-image motion trajectory forecasting: predicting dense future trajectories of scene objects directly from a static image, without requiring auxiliary physical parameters such as velocity or force. We propose a conditional generative model based on a *trajectory grid*, which bypasses redundant pixel-level modeling typical in video generation and instead performs end-to-end synthesis of structured motion fields. Our approach explicitly captures global dynamic patterns and motion uncertainty. Built upon modern video generation architectures, the model is trained jointly on synthetic physics-based simulations and real-world scenes. Experimental results demonstrate significant improvements over state-of-the-art regression- and generation-based methods on both simulated and real-world physical benchmarks. Furthermore, we validate the practical utility and generalization capability of our method in downstream robotic navigation tasks.

Technology Category

Application Category

📝 Abstract

We consider the problem of forecasting motion from a single image, i.e., predicting how objects in the world are likely to move, without the ability to observe other parameters such as the object velocities or the forces applied to them. We formulate this task as conditional generation of dense trajectory grids with a model that closely follows the architecture of modern video generators but outputs motion trajectories instead of pixels. This approach captures scene-wide dynamics and uncertainty, yielding more accurate and diverse predictions than prior regressors and generators. We extensively evaluate our method on simulated data, demonstrate its effectiveness on downstream applications such as robotics, and show promising accuracy on real-world intuitive physics datasets. Although recent state-of-the-art video generators are often regarded as world models, we show that they struggle with forecasting motion from a single image, even in simple physical scenarios such as falling blocks or mechanical object interactions, despite fine-tuning on such data. We show that this limitation arises from the overhead of generating pixels rather than directly modeling motion.

Problem

Research questions and friction points this paper is trying to address.

Anticipating object motion from single images

Generating dense trajectory grids instead of pixels

Overcoming limitations of video generators in motion forecasting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates dense trajectory grids for motion

Models scene-wide dynamics and uncertainty

Directly outputs motion instead of pixels

🔎 Similar Papers

No similar papers found.