🤖 AI Summary
This work addresses the limited generalizability of current data-driven physics simulation methods to real-world scenarios, primarily due to the absence of particle-level state annotations in real videos. To overcome this challenge, we propose the first differentiable particle dynamics model that requires no particle-level supervision and can be trained end-to-end directly on unlabeled real videos. Our approach integrates a dense particle representation based on Gaussian splatting, neural dynamics modeling, and rendering-based supervision to jointly learn the evolution of particle positions and orientations, thereby eliminating the need for heuristic sampling strategies. Evaluated on a newly curated dataset comprising approximately 500 diverse real-world videos of object interactions, our method demonstrates robust motion prediction capabilities in complex, realistic settings.
📝 Abstract
Data-driven learning approaches for physics simulation, sometimes referred to as world models, have emerged as promising alternatives to traditional physics simulators due to their differentiable nature. Prior work has demonstrated impressive results in predicting the motions of rigid and non-rigid objects in complex scenes involving multiple interacting bodies. However, these models are typically trained in simulated environments because obtaining perfect state information such as complete scene point clouds and point correspondences over time is challenging in real-world settings. This reliance on synthetic data can limit their applicability when the sim-to-real gap is large. In this work, we aim to overcome these limitations by introducing a novel framework for training neural object dynamics models directly from unlabeled real-world videos. Specifically, we propose to learn a particle-based dynamics model compatible with a Gaussian splatting framework, which operates on dense particles derived from Gaussians (i.e., particles with scales and rotations) and predicts their position and rotation changes over time. The model is trained via rendering supervision, enabling learning from real-world videos without requiring particle-level labeled states. Our model operates directly on dense Gaussians without relying on heuristic subsampling anchor points. To enable this study, we also present a real-world dataset consisting of about 500 videos capturing diverse object interactions.