Mirage2Matter: A Physically Grounded Gaussian World Model from Video

📅 2026-01-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of scaling embodied intelligence due to the scarcity of real-world interaction data by proposing a novel method that constructs high-fidelity, editable, and physically consistent simulation environments using only ordinary multi-view videos. The approach uniquely integrates 3D Gaussian splatting with generative physics models and achieves precise scale alignment between simulation and reality through careful calibration—eliminating the need for depth sensors or complex calibration procedures. Vision-Language-Action (VLA) models trained on data generated by this framework demonstrate strong zero-shot performance on downstream tasks, matching or even surpassing the performance of models trained on real-world data.

Technology Category

Application Category

📝 Abstract
The scalability of embodied intelligence is fundamentally constrained by the scarcity of real-world interaction data. While simulation platforms provide a promising alternative, existing approaches often suffer from a substantial visual and physical gap to real environments and rely on expensive sensors, precise robot calibration, or depth measurements, limiting their practicality at scale. We present Simulate Anything, a graphics-driven world modeling and simulation framework that enables efficient generation of high-fidelity embodied training data using only multi-view environment videos and off-the-shelf assets. Our approach reconstructs real-world environments into a photorealistic scene representation using 3D Gaussian Splatting (3DGS), seamlessly capturing fine-grained geometry and appearance from video. We then leverage generative models to recover a physically realistic representation and integrate it into a simulation environment via a precision calibration target, enabling accurate scale alignment between the reconstructed scene and the real world. Together, these components provide a unified, editable, and physically grounded world model. Vision Language Action (VLA) models trained on our simulated data achieve strong zero-shot performance on downstream tasks, matching or even surpassing results obtained with real-world data, highlighting the potential of reconstruction-driven world modeling for scalable and practical embodied intelligence training.
Problem

Research questions and friction points this paper is trying to address.

embodied intelligence
world modeling
simulation
3D reconstruction
scalability
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D Gaussian Splatting
physically grounded simulation
world modeling
embodied intelligence
zero-shot transfer
🔎 Similar Papers
No similar papers found.
Z
Zhengqing Gao
MBZUAI, AI2Robotics, The University of Sydney, Carnegie Mellon University, The University of Melbourne
Z
Ziwen Li
MBZUAI, AI2Robotics, The University of Sydney, Carnegie Mellon University, The University of Melbourne
X
Xin Wang
MBZUAI, AI2Robotics, The University of Sydney, Carnegie Mellon University, The University of Melbourne
Jiaxin Huang
Jiaxin Huang
Washington University in St. Louis
Natural Language ProcessingLarge Language ModelsMachine Learning
Z
Zhenyang Ren
MBZUAI, AI2Robotics, The University of Sydney, Carnegie Mellon University, The University of Melbourne
M
Mingkai Shao
MBZUAI, AI2Robotics, The University of Sydney, Carnegie Mellon University, The University of Melbourne
H
Hanlue Zhang
MBZUAI, AI2Robotics, The University of Sydney, Carnegie Mellon University, The University of Melbourne
T
Tianyu Huang
MBZUAI, AI2Robotics, The University of Sydney, Carnegie Mellon University, The University of Melbourne
Yongkang Cheng
Yongkang Cheng
Mohamed bin Zayed University of Artificial Intelligence
Motion CaptureMotion GenerationEmbodied AI
Yandong Guo
Yandong Guo
College of Electronic Science and Engineering, Nanjing University of Posts and Telecommunications
Electronic transport in nanostructures
Runqi Lin
Runqi Lin
PhD student, The University of Sydney
Machine LearningAI SafetyTrustworthy MLAdversarial Robustness
Y
Yuanyuan Wang
MBZUAI, AI2Robotics, The University of Sydney, Carnegie Mellon University, The University of Melbourne
Tongliang Liu
Tongliang Liu
Director, Sydney AI Centre, University of Sydney & Mohamed bin Zayed University of AI
Machine LearningLearning with Noisy LabelsTrustworthy Machine Learning
Kun Zhang
Kun Zhang
Carnegie Mellon University & Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Causal discovery and inferencemachine learningrepresentation learning
Mingming Gong
Mingming Gong
University of Melbourne & Mohamed bin Zayed University of Artificial Intelligence
Causal InferenceMachine LearningComputer Vision