GWM: Towards Scalable Gaussian World Models for Robotic Manipulation

📅 2025-08-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current image-based world models lack robust modeling of 3D geometry and physical principles, limiting their reliability for robotic manipulation. To address this, we propose the Gaussian World Model (GWM), the first generative world model to incorporate differentiable Gaussian primitives into spatiotemporal scene propagation. GWM unifies a latent diffusion Transformer (DiT) with a 3D variational autoencoder to enable fine-grained, action-conditioned future scene prediction. Trained via self-supervised future prediction, it supports end-to-end joint optimization for both imitation learning and model-based reinforcement learning. Evaluated on simulation and real-robot platforms, GWM significantly outperforms state-of-the-art methods in multi-task manipulation prediction accuracy and policy performance. Moreover, it demonstrates superior data scalability and strong 3D semantic consistency—ensuring coherent geometric and physical reasoning across predicted futures.

Technology Category

Application Category

📝 Abstract
Training robot policies within a learned world model is trending due to the inefficiency of real-world interactions. The established image-based world models and policies have shown prior success, but lack robust geometric information that requires consistent spatial and physical understanding of the three-dimensional world, even pre-trained on internet-scale video sources. To this end, we propose a novel branch of world model named Gaussian World Model (GWM) for robotic manipulation, which reconstructs the future state by inferring the propagation of Gaussian primitives under the effect of robot actions. At its core is a latent Diffusion Transformer (DiT) combined with a 3D variational autoencoder, enabling fine-grained scene-level future state reconstruction with Gaussian Splatting. GWM can not only enhance the visual representation for imitation learning agent by self-supervised future prediction training, but can serve as a neural simulator that supports model-based reinforcement learning. Both simulated and real-world experiments depict that GWM can precisely predict future scenes conditioned on diverse robot actions, and can be further utilized to train policies that outperform the state-of-the-art by impressive margins, showcasing the initial data scaling potential of 3D world model.
Problem

Research questions and friction points this paper is trying to address.

Develops 3D Gaussian World Model for robotic manipulation tasks
Addresses lack of geometric information in image-based world models
Enables future state reconstruction through Gaussian primitive propagation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gaussian World Model with Diffusion Transformer
3D variational autoencoder for scene reconstruction
Self-supervised future prediction training
🔎 Similar Papers
No similar papers found.
Guanxing Lu
Guanxing Lu
Tsinghua University
VLARLRobotics3D Vision
Baoxiong Jia
Baoxiong Jia
Ph.D. in Computer Science, UCLA
Computer VisionArtificial Intelligence
Puhao Li
Puhao Li
Ph.D. Student, Tsinghua University
Computer VisionRoboticsMachine Learning
Y
Yixin Chen
State Key Laboratory of General Artificial Intelligence, BIGAI
Z
Ziwei Wang
School of Electrical and Electronic Engineering, Nanyang Technological University
Y
Yansong Tang
Tsinghua University
S
Siyuan Huang
State Key Laboratory of General Artificial Intelligence, BIGAI