ProPhy: Progressive Physical Alignment for Dynamic World Simulation

📅 2025-12-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current video generation models often produce physically inconsistent outputs in large-scale or complex physical scenes, primarily due to insufficient response to global, isotropic physical priors and inadequate fine-grained alignment with local physical cues. To address this, we propose a progressive physics alignment framework featuring a two-stage Mixture-of-Physics-Experts (MoPE) mechanism: semantic experts capture macroscopic physical laws, while refinement experts enforce token-level dynamic physical constraints. Additionally, we introduce a text-driven physics prior extraction and cross-modal alignment strategy that transfers the physical reasoning capabilities of vision-language models into the generative process. Evaluated on multiple physics-aware video generation benchmarks, our method significantly outperforms state-of-the-art approaches, yielding substantial improvements in realism, dynamic coherence, and physical consistency.

Technology Category

Application Category

📝 Abstract
Recent advances in video generation have shown remarkable potential for constructing world simulators. However, current models still struggle to produce physically consistent results, particularly when handling large-scale or complex dynamics. This limitation arises primarily because existing approaches respond isotropically to physical prompts and neglect the fine-grained alignment between generated content and localized physical cues. To address these challenges, we propose ProPhy, a Progressive Physical Alignment Framework that enables explicit physics-aware conditioning and anisotropic generation. ProPhy employs a two-stage Mixture-of-Physics-Experts (MoPE) mechanism for discriminative physical prior extraction, where Semantic Experts infer semantic-level physical principles from textual descriptions, and Refinement Experts capture token-level physical dynamics. This mechanism allows the model to learn fine-grained, physics-aware video representations that better reflect underlying physical laws. Furthermore, we introduce a physical alignment strategy that transfers the physical reasoning capabilities of vision-language models (VLMs) into the Refinement Experts, facilitating a more accurate representation of dynamic physical phenomena. Extensive experiments on physics-aware video generation benchmarks demonstrate that ProPhy produces more realistic, dynamic, and physically coherent results than existing state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

Enhances physical consistency in video generation
Improves alignment with localized physical cues
Transfers physics reasoning from vision-language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive Physical Alignment Framework for physics-aware conditioning
Two-stage Mixture-of-Physics-Experts mechanism for fine-grained dynamics
Physical alignment strategy transferring VLM reasoning to refinement experts
🔎 Similar Papers
No similar papers found.
Z
Zijun Wang
Shenzhen Campus of Sun Yat-sen University, Peng Cheng Laboratory
P
Panwen Hu
Mohamed bin Zayed University of Artificial Intelligence
J
Jing Wang
Shenzhen Campus of Sun Yat-sen University
Terry Jingchen Zhang
Terry Jingchen Zhang
ETH Zurich
(Multimodal) ReasoningAI SafetyActionable InterpretabilityAI4ScienceAstrophysics
Y
Yuhao Cheng
Lenovo Research
L
Long Chen
Lenovo Research
Yiqiang Yan
Yiqiang Yan
Lenovo
Zutao Jiang
Zutao Jiang
Peng Cheng Laboratory
Hanhui Li
Hanhui Li
Sun Yat-sen University
Deep LearningComputer Vision
Xiaodan Liang
Xiaodan Liang
Professor of Computer Science, Sun Yat-sen University, MBZUAI, CMU, NUS
Computer visionEmbodied AIMachine learning