🤖 AI Summary
Current video generation models often produce physically inconsistent outputs in large-scale or complex physical scenes, primarily due to insufficient response to global, isotropic physical priors and inadequate fine-grained alignment with local physical cues. To address this, we propose a progressive physics alignment framework featuring a two-stage Mixture-of-Physics-Experts (MoPE) mechanism: semantic experts capture macroscopic physical laws, while refinement experts enforce token-level dynamic physical constraints. Additionally, we introduce a text-driven physics prior extraction and cross-modal alignment strategy that transfers the physical reasoning capabilities of vision-language models into the generative process. Evaluated on multiple physics-aware video generation benchmarks, our method significantly outperforms state-of-the-art approaches, yielding substantial improvements in realism, dynamic coherence, and physical consistency.
📝 Abstract
Recent advances in video generation have shown remarkable potential for constructing world simulators. However, current models still struggle to produce physically consistent results, particularly when handling large-scale or complex dynamics. This limitation arises primarily because existing approaches respond isotropically to physical prompts and neglect the fine-grained alignment between generated content and localized physical cues. To address these challenges, we propose ProPhy, a Progressive Physical Alignment Framework that enables explicit physics-aware conditioning and anisotropic generation. ProPhy employs a two-stage Mixture-of-Physics-Experts (MoPE) mechanism for discriminative physical prior extraction, where Semantic Experts infer semantic-level physical principles from textual descriptions, and Refinement Experts capture token-level physical dynamics. This mechanism allows the model to learn fine-grained, physics-aware video representations that better reflect underlying physical laws. Furthermore, we introduce a physical alignment strategy that transfers the physical reasoning capabilities of vision-language models (VLMs) into the Refinement Experts, facilitating a more accurate representation of dynamic physical phenomena. Extensive experiments on physics-aware video generation benchmarks demonstrate that ProPhy produces more realistic, dynamic, and physically coherent results than existing state-of-the-art methods.