🤖 AI Summary
Autonomous robots operating in unstructured real-world environments require cross-scene physical reasoning capabilities to enable zero-shot generalization in manipulation planning without retraining. To address this, we propose an end-to-end embodied physical reasoning framework that jointly integrates: (1) 3D Gaussian Splatting for scene reconstruction, (2) SAM-driven object segmentation, (3) LLaVA-guided material and semantic understanding, and (4) differentiable physics simulation (PhysX/Isaac Gym) for joint optimization. Our approach establishes the first unified multimodal model integrating geometry, semantics, material properties, and dynamics—enabling object-centric planning and physics-consistency verification. Evaluated on billiard-style manipulation and quadrotor landing tasks, it achieves sim-to-real zero-shot transfer: real-world success rates improve by 42%, and physics-consistent planning reaches 91.3%.
📝 Abstract
Autonomous robots must reason about the physical consequences of their actions to operate effectively in unstructured, real-world environments. We present Scan, Materialize, Simulate (SMS), a unified framework that combines 3D Gaussian Splatting for accurate scene reconstruction, visual foundation models for semantic segmentation, vision-language models for material property inference, and physics simulation for reliable prediction of action outcomes. By integrating these components, SMS enables generalizable physical reasoning and object-centric planning without the need to re-learn foundational physical dynamics. We empirically validate SMS in a billiards-inspired manipulation task and a challenging quadrotor landing scenario, demonstrating robust performance on both simulated domain transfer and real-world experiments. Our results highlight the potential of bridging differentiable rendering for scene reconstruction, foundation models for semantic understanding, and physics-based simulation to achieve physically grounded robot planning across diverse settings.