Learning to Generate Object Interactions with Physics-Guided Video Diffusion

📅 2025-10-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video generation models struggle to model physically plausible object interactions and lack interpretable physical control mechanisms. This paper proposes KineMask: a physics-guided two-stage video diffusion model (VDM) training strategy that disentangles motion supervision progressively—enabling transfer from synthetic to real-world scenes—using only a single input image and specified initial object velocities. By jointly modeling low-level rigid-body motion control and high-level text semantics via object masks, KineMask integrates explicit physical priors directly into the VDM framework without increasing model size. This marks the first incorporation of explicit physical constraints into VDMs for video generation. Experiments demonstrate significant improvements in physical plausibility and dynamic fidelity of generated interaction videos. Furthermore, the complementary roles of motion control and textual conditioning are empirically validated, enabling controllable synthesis of complex dynamical phenomena—including collisions, rolling, and bouncing—while preserving semantic alignment.

Technology Category

Application Category

📝 Abstract
Recent models for video generation have achieved remarkable progress and are now deployed in film, social media production, and advertising. Beyond their creative potential, such models also hold promise as world simulators for robotics and embodied decision making. Despite strong advances, however, current approaches still struggle to generate physically plausible object interactions and lack physics-grounded control mechanisms. To address this limitation, we introduce KineMask, an approach for physics-guided video generation that enables realistic rigid body control, interactions, and effects. Given a single image and a specified object velocity, our method generates videos with inferred motions and future object interactions. We propose a two-stage training strategy that gradually removes future motion supervision via object masks. Using this strategy we train video diffusion models (VDMs) on synthetic scenes of simple interactions and demonstrate significant improvements of object interactions in real scenes. Furthermore, KineMask integrates low-level motion control with high-level textual conditioning via predictive scene descriptions, leading to effective support for synthesis of complex dynamical phenomena. Extensive experiments show that KineMask achieves strong improvements over recent models of comparable size. Ablation studies further highlight the complementary roles of low- and high-level conditioning in VDMs. Our code, model, and data will be made publicly available.
Problem

Research questions and friction points this paper is trying to address.

Generating physically plausible object interactions in videos
Lacking physics-grounded control mechanisms in video generation
Enabling realistic rigid body control and interaction effects
Innovation

Methods, ideas, or system contributions that make the work stand out.

Physics-guided video diffusion for object interactions
Two-stage training strategy with motion supervision
Low-level motion and high-level text conditioning integration
🔎 Similar Papers
No similar papers found.