PhysMaster: Mastering Physical Representation for Video Generation via Reinforcement Learning

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

career value

258K/year

🤖 AI Summary

Existing video generation models, while visually realistic, often violate physical laws, limiting their applicability to “world model” frameworks. To address this, we propose PhysEncoder—a lightweight module that extracts object positions and interaction priors from a single input frame—and integrate it with a human-feedback-driven reinforcement learning framework. Crucially, we are the first to apply Direct Preference Optimization (DPO) end-to-end for physics-aware representation learning, explicitly guiding video generation to adhere to real-world physical dynamics. Our approach is plug-and-play and model-agnostic. Extensive evaluation across diverse physics-based agent tasks demonstrates substantial improvements in physical plausibility of generated videos. Notably, it exhibits strong generalization to complex physical phenomena—including collisions, gravitational effects, and inertial motion—without task-specific fine-tuning. This work establishes a novel paradigm for building physically grounded world models.

Technology Category

Application Category

📝 Abstract

Video generation models nowadays are capable of generating visually realistic videos, but often fail to adhere to physical laws, limiting their ability to generate physically plausible videos and serve as''world models''. To address this issue, we propose PhysMaster, which captures physical knowledge as a representation for guiding video generation models to enhance their physics-awareness. Specifically, PhysMaster is based on the image-to-video task where the model is expected to predict physically plausible dynamics from the input image. Since the input image provides physical priors like relative positions and potential interactions of objects in the scenario, we devise PhysEncoder to encode physical information from it as an extra condition to inject physical knowledge into the video generation process. The lack of proper supervision on the model's physical performance beyond mere appearance motivates PhysEncoder to apply reinforcement learning with human feedback to physical representation learning, which leverages feedback from generation models to optimize physical representations with Direct Preference Optimization (DPO) in an end-to-end manner. PhysMaster provides a feasible solution for improving physics-awareness of PhysEncoder and thus of video generation, proving its ability on a simple proxy task and generalizability to wide-ranging physical scenarios. This implies that our PhysMaster, which unifies solutions for various physical processes via representation learning in the reinforcement learning paradigm, can act as a generic and plug-in solution for physics-aware video generation and broader applications.

Problem

Research questions and friction points this paper is trying to address.

Enhancing video generation models' adherence to physical laws

Learning physical representations from images using reinforcement learning

Improving physics-awareness in video generation via human feedback optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning optimizes physical representation learning

Physical encoder injects knowledge into video generation process

Direct Preference Optimization enables end-to-end representation training

🔎 Similar Papers

Video-Driven Graph Network-Based Simulators