🤖 AI Summary
This work addresses the challenge of scaling embodied intelligence due to the scarcity of real-world interaction data by proposing a novel method that constructs high-fidelity, editable, and physically consistent simulation environments using only ordinary multi-view videos. The approach uniquely integrates 3D Gaussian splatting with generative physics models and achieves precise scale alignment between simulation and reality through careful calibration—eliminating the need for depth sensors or complex calibration procedures. Vision-Language-Action (VLA) models trained on data generated by this framework demonstrate strong zero-shot performance on downstream tasks, matching or even surpassing the performance of models trained on real-world data.
📝 Abstract
The scalability of embodied intelligence is fundamentally constrained by the scarcity of real-world interaction data. While simulation platforms provide a promising alternative, existing approaches often suffer from a substantial visual and physical gap to real environments and rely on expensive sensors, precise robot calibration, or depth measurements, limiting their practicality at scale. We present Simulate Anything, a graphics-driven world modeling and simulation framework that enables efficient generation of high-fidelity embodied training data using only multi-view environment videos and off-the-shelf assets. Our approach reconstructs real-world environments into a photorealistic scene representation using 3D Gaussian Splatting (3DGS), seamlessly capturing fine-grained geometry and appearance from video. We then leverage generative models to recover a physically realistic representation and integrate it into a simulation environment via a precision calibration target, enabling accurate scale alignment between the reconstructed scene and the real world. Together, these components provide a unified, editable, and physically grounded world model. Vision Language Action (VLA) models trained on our simulated data achieve strong zero-shot performance on downstream tasks, matching or even surpassing results obtained with real-world data, highlighting the potential of reconstruction-driven world modeling for scalable and practical embodied intelligence training.