🤖 AI Summary
Existing video diffusion models generate visually plausible content but frequently violate Newtonian mechanics—exhibiting phenomena such as object levitation, anomalous acceleration, and inconsistent collisions—revealing a critical deficiency in physical plausibility. To address this, we propose the first measurable-proxy-based, physics-guided post-training framework: it employs optical flow as a velocity proxy and appearance features as a mass proxy, and introduces reward functions enforcing constant acceleration and mass conservation to explicitly model Newtonian dynamics. Crucially, our method requires no human annotations or vision-language model (VLM) feedback; instead, it leverages only frozen auxiliary models to extract physical signals. We introduce NewtonBench-60K, a large-scale, manually curated physical reasoning benchmark. Evaluation shows substantial improvements in physical plausibility, motion smoothness, and temporal coherence, with strong out-of-distribution generalization.
📝 Abstract
Recent video diffusion models can synthesize visually compelling clips, yet often violate basic physical laws-objects float, accelerations drift, and collisions behave inconsistently-revealing a persistent gap between visual realism and physical realism. We propose $ exttt{NewtonRewards}$, the first physics-grounded post-training framework for video generation based on $ extit{verifiable rewards}$. Instead of relying on human or VLM feedback, $ exttt{NewtonRewards}$ extracts $ extit{measurable proxies}$ from generated videos using frozen utility models: optical flow serves as a proxy for velocity, while high-level appearance features serve as a proxy for mass. These proxies enable explicit enforcement of Newtonian structure through two complementary rewards: a Newtonian kinematic constraint enforcing constant-acceleration dynamics, and a mass conservation reward preventing trivial, degenerate solutions. We evaluate $ exttt{NewtonRewards}$ on five Newtonian Motion Primitives (free fall, horizontal/parabolic throw, and ramp sliding down/up) using our newly constructed large-scale benchmark, $ exttt{NewtonBench-60K}$. Across all primitives in visual and physics metrics, $ exttt{NewtonRewards}$ consistently improves physical plausibility, motion smoothness, and temporal coherence over prior post-training methods. It further maintains strong performance under out-of-distribution shifts in height, speed, and friction. Our results show that physics-grounded verifiable rewards offer a scalable path toward physics-aware video generation.