Improving the Physics of Video Generation with VJEPA-2 Reward Signal

📅 2025-10-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Contemporary video generation models achieve high visual fidelity but suffer from pervasive physical implausibility, as revealed by the Physics IQ benchmark—indicating a critical disconnect from physical understanding. To address this, we propose a physics-consistency enhancement method grounded in a self-supervised video world model: for the first time, we integrate VJEPA-2 as a differentiable reward signal into the MAGI-1 generation framework and optimize the generative process via reinforcement learning, enabling implicit acquisition of physical principles. Our approach circumvents reliance on manual annotations or explicit physics engines, instead leveraging dynamic priors encoded in self-supervised pretraining. On the Physics IQ benchmark, it improves physical plausibility by 6.0% over strong baselines. Key contributions are: (1) the first use of VJEPA-2 as a reward for physics-aware video generation; and (2) empirical validation that general-purpose, self-supervised video representations effectively transfer to physical reasoning in generative tasks.

Technology Category

Application Category

📝 Abstract
This is a short technical report describing the winning entry of the PhysicsIQ Challenge, presented at the Perception Test Workshop at ICCV 2025. State-of-the-art video generative models exhibit severely limited physical understanding, and often produce implausible videos. The Physics IQ benchmark has shown that visual realism does not imply physics understanding. Yet, intuitive physics understanding has shown to emerge from SSL pretraining on natural videos. In this report, we investigate whether we can leverage SSL-based video world models to improve the physics plausibility of video generative models. In particular, we build ontop of the state-of-the-art video generative model MAGI-1 and couple it with the recently introduced Video Joint Embedding Predictive Architecture 2 (VJEPA-2) to guide the generation process. We show that by leveraging VJEPA-2 as reward signal, we can improve the physics plausibility of state-of-the-art video generative models by ~6%.
Problem

Research questions and friction points this paper is trying to address.

Improving physics plausibility in video generation
Leveraging SSL-based world models for guidance
Using VJEPA-2 reward to enhance physical understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses VJEPA-2 reward signal for video generation
Couples MAGI-1 model with VJEPA-2 architecture
Leverages SSL-based world models to guide generation
🔎 Similar Papers
No similar papers found.