🤖 AI Summary
Existing methods struggle to accurately model physical environments in data-sparse, contact-rich complex dynamic scenes. This work proposes a differentiable, physics-driven rigid-body world model that, for the first time, employs a unified Gaussian representation to jointly model visual appearance and collision geometry. Integrated with an end-to-end differentiable physics engine, the model enables learning complex physical dynamics directly from sparse video sequences. The approach supports inference of physical properties and demonstrates superior performance over existing methods in both simulated and real-world scenarios, exhibiting strong generalization capabilities. Furthermore, it has been successfully applied to synthetic data generation and real-time model-predictive control.
📝 Abstract
Developing world models that understand complex physical interactions is essential for advancing robotic planning and simulation.However, existing methods often struggle to accurately model the environment under conditions of data scarcity and complex contact-rich dynamic motion.To address these challenges, we propose ContactGaussian-WM, a differentiable physics-grounded rigid-body world model capable of learning intricate physical laws directly from sparse and contact-rich video sequences.Our framework consists of two core components: (1) a unified Gaussian representation for both visual appearance and collision geometry, and (2) an end-to-end differentiable learning framework that differentiates through a closed-form physics engine to infer physical properties from sparse visual observations.Extensive simulations and real-world evaluations demonstrate that ContactGaussian-WM outperforms state-of-the-art methods in learning complex scenarios, exhibiting robust generalization capabilities.Furthermore, we showcase the practical utility of our framework in downstream applications, including data synthesis and real-time MPC.