What about gravity in video generation? Post-Training Newton's Laws with Verifiable Rewards

📅 2025-11-29

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Existing video diffusion models generate visually plausible content but frequently violate Newtonian mechanics—exhibiting phenomena such as object levitation, anomalous acceleration, and inconsistent collisions—revealing a critical deficiency in physical plausibility. To address this, we propose the first measurable-proxy-based, physics-guided post-training framework: it employs optical flow as a velocity proxy and appearance features as a mass proxy, and introduces reward functions enforcing constant acceleration and mass conservation to explicitly model Newtonian dynamics. Crucially, our method requires no human annotations or vision-language model (VLM) feedback; instead, it leverages only frozen auxiliary models to extract physical signals. We introduce NewtonBench-60K, a large-scale, manually curated physical reasoning benchmark. Evaluation shows substantial improvements in physical plausibility, motion smoothness, and temporal coherence, with strong out-of-distribution generalization.

Technology Category

Application Category

📝 Abstract

Recent video diffusion models can synthesize visually compelling clips, yet often violate basic physical laws-objects float, accelerations drift, and collisions behave inconsistently-revealing a persistent gap between visual realism and physical realism. We propose $ exttt{NewtonRewards}$, the first physics-grounded post-training framework for video generation based on $ extit{verifiable rewards}$. Instead of relying on human or VLM feedback, $ exttt{NewtonRewards}$ extracts $ extit{measurable proxies}$ from generated videos using frozen utility models: optical flow serves as a proxy for velocity, while high-level appearance features serve as a proxy for mass. These proxies enable explicit enforcement of Newtonian structure through two complementary rewards: a Newtonian kinematic constraint enforcing constant-acceleration dynamics, and a mass conservation reward preventing trivial, degenerate solutions. We evaluate $ exttt{NewtonRewards}$ on five Newtonian Motion Primitives (free fall, horizontal/parabolic throw, and ramp sliding down/up) using our newly constructed large-scale benchmark, $ exttt{NewtonBench-60K}$. Across all primitives in visual and physics metrics, $ exttt{NewtonRewards}$ consistently improves physical plausibility, motion smoothness, and temporal coherence over prior post-training methods. It further maintains strong performance under out-of-distribution shifts in height, speed, and friction. Our results show that physics-grounded verifiable rewards offer a scalable path toward physics-aware video generation.

Problem

Research questions and friction points this paper is trying to address.

Improving physical plausibility in video generation

Enforcing Newtonian laws via verifiable rewards

Addressing violations like floating objects and inconsistent collisions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Physics-grounded post-training framework with verifiable rewards

Uses optical flow and appearance features as measurable proxies

Enforces Newtonian constraints through kinematic and mass conservation rewards

🔎 Similar Papers

MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance