🤖 AI Summary
Contemporary video generation models often produce dynamically incoherent or physically implausible motion due to a disconnect between motion logic and physical laws. To address this, we propose a two-stage framework that decouples motion reasoning from image synthesis: (1) a scene-aware encoder generates semantic motion representations—e.g., segmentation or depth maps—that explicitly encode physical causal relationships; (2) a physics-constrained, conditional diffusion model synthesizes high-fidelity videos conditioned on these representations. Our approach enables the first controllable causal modeling of complex dynamical processes—including domino toppling and vehicle reaction behaviors. Evaluated on Physion and autonomous driving simulation datasets, our method achieves significant improvements in motion-physical consistency metrics. Both qualitative and quantitative results demonstrate accurate modeling of real-world physical evolutions such as collision, rolling, and tracking.
📝 Abstract
Recent numerous video generation models, also known as world models, have demonstrated the ability to generate plausible real-world videos. However, many studies have shown that these models often produce motion results lacking logical or physical coherence. In this paper, we revisit video generation models and find that single-stage approaches struggle to produce high-quality results while maintaining coherent motion reasoning. To address this issue, we propose extbf{Motion Dreamer}, a two-stage video generation framework. In Stage I, the model generates an intermediate motion representation-such as a segmentation map or depth map-based on the input image and motion conditions, focusing solely on the motion itself. In Stage II, the model uses this intermediate motion representation as a condition to generate a high-detail video. By decoupling motion reasoning from high-fidelity video synthesis, our approach allows for more accurate and physically plausible motion generation. We validate the effectiveness of our approach on the Physion dataset and in autonomous driving scenarios. For example, given a single push, our model can synthesize the sequential toppling of a set of dominoes. Similarly, by varying the movements of ego-cars, our model can produce different effects on other vehicles. Our work opens new avenues in creating models that can reason about physical interactions in a more coherent and realistic manner. Our webpage is available: https://envision-research.github.io/MotionDreamer/.