π€ AI Summary
This work addresses the computational inefficiency of conventional video synthesis methods in long-horizon motion generation by proposing an efficient generative framework based on highly compressed motion embeddings. The approach learns an implicit motion representation from large-scale trajectory data, achieving a temporal compression ratio of 64Γ, and constructs a conditional flow-matching model within this compressed latent space to flexibly respond to text prompts or spatial perturbations. Notably, it is the first method to directly model dynamics in the compressed embedding space, substantially improving both generation efficiency and controllability. Experiments demonstrate that the generated motions surpass state-of-the-art video generation models and specialized motion synthesis approaches in terms of realism, diversity, and computational efficiency.
π Abstract
Understanding and predicting motion is a fundamental component of visual intelligence. Although modern video models exhibit strong comprehension of scene dynamics, exploring multiple possible futures through full video synthesis remains prohibitively inefficient. We model scene dynamics orders of magnitude more efficiently by directly operating on a long-term motion embedding that is learned from large-scale trajectories obtained from tracker models. This enables efficient generation of long, realistic motions that fulfill goals specified via text prompts or spatial pokes. To achieve this, we first learn a highly compressed motion embedding with a temporal compression factor of 64x. In this space, we train a conditional flow-matching model to generate motion latents conditioned on task descriptions. The resulting motion distributions outperform those of both state-of-the-art video models and specialized task-specific approaches.