🤖 AI Summary
Existing video generation models produce high-fidelity videos but often lack physical plausibility and 3D controllability. To address this, we propose a physics-anchored image-to-video generation framework. Our method introduces a generative physics network that explicitly models multi-material dynamics—including elastic bodies, granular media (e.g., sand), viscoelastic putty, and rigid bodies—alongside a spatiotemporal attention module to capture inter-particle interactions. We jointly optimize trajectory plausibility and visual quality via a composite loss incorporating physics-based constraints. Furthermore, we employ a diffusion model to synthesize physically consistent 3D point trajectories, which drive controllable video synthesis. Trained on 550K synthetic samples, our approach surpasses state-of-the-art methods in both physical plausibility and visual fidelity. It enables fine-grained dynamic editing guided by physical parameters (e.g., elasticity, friction) and external forces (e.g., gravity, impact), offering unprecedented control over physically grounded video generation.
📝 Abstract
Existing video generation models excel at producing photo-realistic videos from text or images, but often lack physical plausibility and 3D controllability. To overcome these limitations, we introduce PhysCtrl, a novel framework for physics-grounded image-to-video generation with physical parameters and force control. At its core is a generative physics network that learns the distribution of physical dynamics across four materials (elastic, sand, plasticine, and rigid) via a diffusion model conditioned on physics parameters and applied forces. We represent physical dynamics as 3D point trajectories and train on a large-scale synthetic dataset of 550K animations generated by physics simulators. We enhance the diffusion model with a novel spatiotemporal attention block that emulates particle interactions and incorporates physics-based constraints during training to enforce physical plausibility. Experiments show that PhysCtrl generates realistic, physics-grounded motion trajectories which, when used to drive image-to-video models, yield high-fidelity, controllable videos that outperform existing methods in both visual quality and physical plausibility. Project Page: https://cwchenwang.github.io/physctrl