PhyWorld: Physics-Faithful World Model for Video Generation

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

240K/year

🤖 AI Summary

Existing video generation models often lack temporal coherence and adherence to physical laws, limiting their utility for high-fidelity simulation in physical AI training. To address this, this work proposes a two-stage post-training approach: first, flow matching fine-tuning enhances inter-frame visual and motion consistency; second, it introduces physical preference pairs with Direct Preference Optimization (DPO) into video generation—marking the first such application—to explicitly align outputs with fundamental physical principles. The method substantially improves the physical plausibility of generated videos, achieving a score of 0.769 on VBench, surpassing the current state of the art, and obtaining 3.09 on a newly curated physical fidelity benchmark, significantly outperforming the strongest baseline at 2.99.

📝 Abstract

World simulators can provide safe and scalable environments for training Physical AI systems before real-world deployment. Large video generation models are emerging as a promising basis for such simulators because they can generate diverse and realistic visual futures. However, using them as world simulators requires physically faithful video continuations, namely, generated videos that preserve the physical state implied by the conditioning input, and evolve in ways consistent with basic physical principles. We propose PhyWorld, a video generation world model designed to produce temporally coherent and physically faithful scene continuations through two-stage post-training. In the first stage, we improve video-to-video continuation with flow matching fine-tuning, encouraging stable visual attributes and coherent motion dynamics across frames. In the second stage, we align generated dynamics with physical principles using Direct Preference Optimization (DPO) over physics preference pairs, guiding the model toward outputs with higher physical plausibility. To evaluate PhyWorld, we use both standard video-quality benchmarks and a dedicated physical-faithfulness benchmark with per-law scoring. Experiments show that PhyWorld improves video consistency, achieving an average score of 0.769 on VBench compared with 0.756 or below for state-of-the-art baselines. PhyWorld also improves physical plausibility, reaching an average score of 3.09 on our physical-faithfulness benchmark compared with 2.99 for the strongest baseline. These results suggest that post-training large video generation models with continuation and physics-preference signals can make them more effective world simulators for Physical AI.

Problem

Research questions and friction points this paper is trying to address.

world model

video generation

physical faithfulness

physics simulation

temporal coherence

Innovation

Methods, ideas, or system contributions that make the work stand out.

physics-faithful

video generation

world model