🤖 AI Summary
This work addresses the challenge of intuitive physical modeling in robotic embodied intelligence by proposing a novel method to generate diverse, stable, and stackable 3D structures from a single target silhouette image. To overcome limitations of conventional approaches—namely their reliance on explicit physics simulation or geometric modeling—we introduce, for the first time, conditional diffusion models for stable structural synthesis, enabling end-to-end configuration generation. Our method jointly incorporates silhouette encoding, implicit stability constraints, and discrete structural representation, learning the mapping from silhouette to physically viable stacking configurations without requiring real-valued physical gradients or explicit dynamical modeling. Evaluations in simulation and on a real robotic manipulator platform demonstrate strong generalization: generated structures achieve over 85% assembly success rate, significantly enhancing robots’ ability to perform physics-based reasoning and autonomous construction solely from visual input.
📝 Abstract
Humans naturally obtain intuition about the interactions between and the stability of rigid objects by observing and interacting with the world. It is this intuition that governs the way in which we regularly configure objects in our environment, allowing us to build complex structures from simple, everyday objects. Robotic agents, on the other hand, traditionally require an explicit model of the world that includes the detailed geometry of each object and an analytical model of the environment dynamics, which are difficult to scale and preclude generalization. Instead, robots would benefit from an awareness of intuitive physics that enables them to similarly reason over the stable interaction of objects in their environment. Towards that goal, we propose StackGen, a diffusion model that generates diverse stable configurations of building blocks matching a target silhouette. To demonstrate the capability of the method, we evaluate it in a simulated environment and deploy it in the real setting using a robotic arm to assemble structures generated by the model.