🤖 AI Summary
Existing video generation methods often suffer from a lack of physical consistency, manifesting as object drift, implausible collisions, and unrealistic material responses. This work proposes a controllable video generation framework that achieves physically plausible synthesis without relying on simulators or geometric reconstruction during inference. By leveraging a large-scale dataset of physics-simulated videos, the approach combines ControlNet fine-tuning conditioned on pixel-aligned physical attribute maps with differentiable reward optimization guided by a vision-language model (VLM), enabling continuous, interpretable, and precise control over physical properties such as friction and elasticity. Integrating physics-supervised fine-tuning with VLM-based feedback for the first time, the method substantially outperforms strong baselines on the Physics-IQ benchmark, and human evaluations confirm its superior physical realism and controllability in generated videos.
📝 Abstract
Modern video diffusion models excel at appearance synthesis but still struggle with physical consistency: objects drift, collisions lack realistic rebound, and material responses seldom match their underlying properties. We present PhyCo, a framework that introduces continuous, interpretable, and physically grounded control into video generation. Our approach integrates three key components: (i) a large-scale dataset of over 100K photorealistic simulation videos where friction, restitution, deformation, and force are systematically varied across diverse scenarios; (ii) physics-supervised fine-tuning of a pretrained diffusion model using a ControlNet conditioned on pixel-aligned physical property maps; and (iii) VLM-guided reward optimization, where a fine-tuned vision-language model evaluates generated videos with targeted physics queries and provides differentiable feedback. This combination enables a generative model to produce physically consistent and controllable outputs through variations in physical attributes-without any simulator or geometry reconstruction at inference. On the Physics-IQ benchmark, PhyCo significantly improves physical realism over strong baselines, and human studies confirm clearer and more faithful control over physical attributes. Our results demonstrate a scalable path toward physically consistent, controllable generative video models that generalize beyond synthetic training environments.