Phys4D: Fine-Grained Physics-Consistent 4D Modeling from Video Diffusion

📅 2026-03-03

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Existing video diffusion models struggle to generate long-duration 4D dynamic scenes that are both physically consistent and spatiotemporally coherent. To address this, this work proposes Phys4D, the first framework to explicitly integrate physical consistency modeling into the 4D generation pipeline. It employs a three-stage progressive training strategy: first, large-scale pseudo-supervised pretraining establishes foundational geometry and motion priors; second, physics-aware fine-tuning leverages simulation data to enforce physical plausibility; and third, simulation-guided reinforcement learning corrects residual physical violations. The authors also introduce a comprehensive 4D world consistency evaluation suite encompassing geometric fidelity, motion stability, and long-term physical realism. Experiments demonstrate that Phys4D significantly enhances spatiotemporal detail and physical coherence while preserving strong generative capabilities, outperforming existing appearance-driven approaches across all metrics.

Technology Category

Application Category

📝 Abstract

Recent video diffusion models have achieved impressive capabilities as large-scale generative world models. However, these models often struggle with fine-grained physical consistency, exhibiting physically implausible dynamics over time. In this work, we present \textbf{Phys4D}, a pipeline for learning physics-consistent 4D world representations from video diffusion models. Phys4D adopts \textbf{a three-stage training paradigm} that progressively lifts appearance-driven video diffusion models into physics-consistent 4D world representations. We first bootstrap robust geometry and motion representations through large-scale pseudo-supervised pretraining, establishing a foundation for 4D scene modeling. We then perform physics-grounded supervised fine-tuning using simulation-generated data, enforcing temporally consistent 4D dynamics. Finally, we apply simulation-grounded reinforcement learning to correct residual physical violations that are difficult to capture through explicit supervision. To evaluate fine-grained physical consistency beyond appearance-based metrics, we introduce a set of \textbf{4D world consistency evaluation} that probe geometric coherence, motion stability, and long-horizon physical plausibility. Experimental results demonstrate that Phys4D substantially improves fine-grained spatiotemporal and physical consistency compared to appearance-driven baselines, while maintaining strong generative performance. Our project page is available at https://sensational-brioche-7657e7.netlify.app/

Problem

Research questions and friction points this paper is trying to address.

physical consistency

video diffusion

4D modeling

spatiotemporal dynamics

physics plausibility

Innovation

Methods, ideas, or system contributions that make the work stand out.

physics-consistent 4D modeling

video diffusion models

three-stage training paradigm