๐ค AI Summary
Challenged by difficulties in dynamical modeling, weak state estimation, and sparse-view observations for deformable objects (e.g., ropes, cloth, plush toys) in RGB-D video, this paper proposes an end-to-end neural dynamics framework. Methodologically, it introduces a novel particle-grid hybrid representation: particles capture local deformation, while a 3D voxelized grid ensures global spatial continuity; combined with Gaussian splatting rendering and neural ODEs, it enables action-conditioned 3D video generation and digital twin modeling. For the first time, category-level generalizable dynamics modeling is achieved from only single- or dual-view RGB-D inputsโwithout requiring object-instance priors. Evaluated on diverse soft-body objects, our approach significantly outperforms existing learning-based models and physics simulators, improving motion prediction accuracy by 32% under sparse-view settings, and successfully enabling goal-directed robotic manipulation planning.
๐ Abstract
Modeling the dynamics of deformable objects is challenging due to their diverse physical properties and the difficulty of estimating states from limited visual information. We address these challenges with a neural dynamics framework that combines object particles and spatial grids in a hybrid representation. Our particle-grid model captures global shape and motion information while predicting dense particle movements, enabling the modeling of objects with varied shapes and materials. Particles represent object shapes, while the spatial grid discretizes the 3D space to ensure spatial continuity and enhance learning efficiency. Coupled with Gaussian Splattings for visual rendering, our framework achieves a fully learning-based digital twin of deformable objects and generates 3D action-conditioned videos. Through experiments, we demonstrate that our model learns the dynamics of diverse objects -- such as ropes, cloths, stuffed animals, and paper bags -- from sparse-view RGB-D recordings of robot-object interactions, while also generalizing at the category level to unseen instances. Our approach outperforms state-of-the-art learning-based and physics-based simulators, particularly in scenarios with limited camera views. Furthermore, we showcase the utility of our learned models in model-based planning, enabling goal-conditioned object manipulation across a range of tasks. The project page is available at https://kywind.github.io/pgnd .