Improving the Physics of Video Generation with VJEPA-2 Reward Signal

📅 2025-10-22

📈 Citations: 0

✨ Influential: 0

career value

246K/year

🤖 AI Summary

Contemporary video generation models achieve high visual fidelity but suffer from pervasive physical implausibility, as revealed by the Physics IQ benchmark—indicating a critical disconnect from physical understanding. To address this, we propose a physics-consistency enhancement method grounded in a self-supervised video world model: for the first time, we integrate VJEPA-2 as a differentiable reward signal into the MAGI-1 generation framework and optimize the generative process via reinforcement learning, enabling implicit acquisition of physical principles. Our approach circumvents reliance on manual annotations or explicit physics engines, instead leveraging dynamic priors encoded in self-supervised pretraining. On the Physics IQ benchmark, it improves physical plausibility by 6.0% over strong baselines. Key contributions are: (1) the first use of VJEPA-2 as a reward for physics-aware video generation; and (2) empirical validation that general-purpose, self-supervised video representations effectively transfer to physical reasoning in generative tasks.

Technology Category

Application Category

📝 Abstract

This is a short technical report describing the winning entry of the PhysicsIQ Challenge, presented at the Perception Test Workshop at ICCV 2025. State-of-the-art video generative models exhibit severely limited physical understanding, and often produce implausible videos. The Physics IQ benchmark has shown that visual realism does not imply physics understanding. Yet, intuitive physics understanding has shown to emerge from SSL pretraining on natural videos. In this report, we investigate whether we can leverage SSL-based video world models to improve the physics plausibility of video generative models. In particular, we build ontop of the state-of-the-art video generative model MAGI-1 and couple it with the recently introduced Video Joint Embedding Predictive Architecture 2 (VJEPA-2) to guide the generation process. We show that by leveraging VJEPA-2 as reward signal, we can improve the physics plausibility of state-of-the-art video generative models by ~6%.

Problem

Research questions and friction points this paper is trying to address.

Improving physics plausibility in video generation

Leveraging SSL-based world models for guidance

Using VJEPA-2 reward to enhance physical understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses VJEPA-2 reward signal for video generation

Couples MAGI-1 model with VJEPA-2 architecture

Leverages SSL-based world models to guide generation

🔎 Similar Papers

No similar papers found.

TikTok

San Jose, California

Senior AI Engineer, World Foundation Models

Nvidia

The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5. You will also be eligible for equity and benefits.

US, CA, Remote / US, WA, Remote / US, OR, Remote

AI Research Scientist, Computer Vision - Facebook Video Intelligence