🤖 AI Summary
Existing video diffusion models struggle to generate long-duration 4D dynamic scenes that are both physically consistent and spatiotemporally coherent. To address this, this work proposes Phys4D, the first framework to explicitly integrate physical consistency modeling into the 4D generation pipeline. It employs a three-stage progressive training strategy: first, large-scale pseudo-supervised pretraining establishes foundational geometry and motion priors; second, physics-aware fine-tuning leverages simulation data to enforce physical plausibility; and third, simulation-guided reinforcement learning corrects residual physical violations. The authors also introduce a comprehensive 4D world consistency evaluation suite encompassing geometric fidelity, motion stability, and long-term physical realism. Experiments demonstrate that Phys4D significantly enhances spatiotemporal detail and physical coherence while preserving strong generative capabilities, outperforming existing appearance-driven approaches across all metrics.
📝 Abstract
Recent video diffusion models have achieved impressive capabilities as large-scale generative world models. However, these models often struggle with fine-grained physical consistency, exhibiting physically implausible dynamics over time. In this work, we present \textbf{Phys4D}, a pipeline for learning physics-consistent 4D world representations from video diffusion models. Phys4D adopts \textbf{a three-stage training paradigm} that progressively lifts appearance-driven video diffusion models into physics-consistent 4D world representations. We first bootstrap robust geometry and motion representations through large-scale pseudo-supervised pretraining, establishing a foundation for 4D scene modeling. We then perform physics-grounded supervised fine-tuning using simulation-generated data, enforcing temporally consistent 4D dynamics. Finally, we apply simulation-grounded reinforcement learning to correct residual physical violations that are difficult to capture through explicit supervision. To evaluate fine-grained physical consistency beyond appearance-based metrics, we introduce a set of \textbf{4D world consistency evaluation} that probe geometric coherence, motion stability, and long-horizon physical plausibility. Experimental results demonstrate that Phys4D substantially improves fine-grained spatiotemporal and physical consistency compared to appearance-driven baselines, while maintaining strong generative performance. Our project page is available at https://sensational-brioche-7657e7.netlify.app/