Inference-time Physics Alignment of Video Generative Models with Latent World Models

📅 2026-01-15

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Existing video generation models often violate fundamental physical laws, limiting their practical applicability. This work addresses the enhancement of physical plausibility as an alignment problem during inference and proposes, for the first time, a training-free approach that leverages physical priors from a latent world model (VJEPA-2) as a reward signal at inference time. By integrating multi-trajectory denoising with compute-aware expansion strategies, the method effectively guides the generation process toward physically coherent outcomes. Evaluated across diverse conditional video generation tasks, the approach significantly improves physical realism, as confirmed by human preference studies. Notably, it achieved a winning score of 62.64% in the ICCV 2025 PhysicsIQ Challenge, surpassing the previous state-of-the-art by 7.42%.

Technology Category

Application Category

📝 Abstract

State-of-the-art video generative models produce promising visual content yet often violate basic physics principles, limiting their utility. While some attribute this deficiency to insufficient physics understanding from pre-training, we find that the shortfall in physics plausibility also stems from suboptimal inference strategies. We therefore introduce WMReward and treat improving physics plausibility of video generation as an inference-time alignment problem. In particular, we leverage the strong physics prior of a latent world model (here, VJEPA-2) as a reward to search and steer multiple candidate denoising trajectories, enabling scaling test-time compute for better generation performance. Empirically, our approach substantially improves physics plausibility across image-conditioned, multiframe-conditioned, and text-conditioned generation settings, with validation from human preference study. Notably, in the ICCV 2025 Perception Test PhysicsIQ Challenge, we achieve a final score of 62.64%, winning first place and outperforming the previous state of the art by 7.42%. Our work demonstrates the viability of using latent world models to improve physics plausibility of video generation, beyond this specific instantiation or parameterization.

Problem

Research questions and friction points this paper is trying to address.

video generative models

physics plausibility

inference-time alignment

latent world models

physics violation

Innovation

Methods, ideas, or system contributions that make the work stand out.

inference-time alignment

physics plausibility

latent world model