🤖 AI Summary
Video diffusion models (VDMs) generate high-fidelity videos but often suffer from dynamics distortions and temporal ordering errors due to insufficient physical commonsense reasoning. To address this, we propose a physics-aware two-stage image-to-video generation framework. In the first stage, a vision-language model (VLM) performs cross-frame-consistent, coarse-grained motion planning via chain-of-reasoning augmented with physical knowledge. In the second stage, a controllable noise injection mechanism embeds physical constraints directly into the VDM sampling process, preserving visual fidelity while ensuring physical plausibility. Our approach is the first to integrate physics-driven chain-of-reasoning into VLM-based motion modeling and enables co-optimization of planning and generation. Experiments demonstrate significant improvements over state-of-the-art methods on multiple physical consistency benchmarks, particularly in dynamics accuracy and event temporal logic.
📝 Abstract
Video diffusion models (VDMs) have advanced significantly in recent years, enabling the generation of highly realistic videos and drawing the attention of the community in their potential as world simulators. However, despite their capabilities, VDMs often fail to produce physically plausible videos due to an inherent lack of understanding of physics, resulting in incorrect dynamics and event sequences. To address this limitation, we propose a novel two-stage image-to-video generation framework that explicitly incorporates physics. In the first stage, we employ a Vision Language Model (VLM) as a coarse-grained motion planner, integrating chain-of-thought and physics-aware reasoning to predict a rough motion trajectories/changes that approximate real-world physical dynamics while ensuring the inter-frame consistency. In the second stage, we use the predicted motion trajectories/changes to guide the video generation of a VDM. As the predicted motion trajectories/changes are rough, noise is added during inference to provide freedom to the VDM in generating motion with more fine details. Extensive experimental results demonstrate that our framework can produce physically plausible motion, and comparative evaluations highlight the notable superiority of our approach over existing methods. More video results are available on our Project Page: https://madaoer.github.io/projects/physically_plausible_video_generation.