GaussianDream: A Feed-Forward 3D Gaussian World Model for Robotic Manipulation

📅 2026-05-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

226K/year
🤖 AI Summary
Existing vision-language action policies struggle to achieve high-precision physical manipulation due to the absence of explicit supervision on 3D geometry, dense visual structure, and temporal dynamics. This work introduces 3D Gaussian representations into robotic world modeling for the first time, proposing a feedforward 3D Gaussian world model plugin that jointly reconstructs current and predicted future 3D Gaussian states. This generates spatiotemporal supervision signals decodable into dense RGB, depth, and pseudo-3D scene flow. Notably, the method requires neither rendering nor planning at test time, instead leveraging only a lightweight spatiotemporal prefix to drive action generation. It achieves success rates of 98.4%, 52.6%, and 50.0% on LIBERO, RoboCasa Human-50, and real-world robot tasks, respectively, substantially outperforming existing approaches.
📝 Abstract
Vision-language-action (VLA) policies have advanced language-conditioned robotic manipulation by transferring semantic priors from pretrained vision-language models to action generation. Yet, standard action-imitation training often provides limited explicit supervision for 3D geometry, dense visual structure, and short-horizon environment evolution, which are critical for physically precise manipulation. We introduce \textbf{GaussianDream}, a feed-forward 3D Gaussian world-model plug-in that turns robot trajectories into structured spatial-temporal supervision. The key idea is to couple current Gaussian reconstruction with horizon-conditioned future Gaussian prediction during training, forcing a compact spatio-temporal prefix to be decodable into renderable 3D Gaussian states. This enables dense RGB rendering, depth, and pseudo 3D scene-flow supervision without requiring test-time Gaussian decoding. At inference, GaussianDream discards all auxiliary decoding heads and retains only the learned prefix to condition action generation, avoiding rendering, video rollout, or additional planning during closed-loop control. Experiments on LIBERO, RoboCasa Human-50, and real-robot tasks demonstrate strong and highly competitive performance, achieving \textbf{98.4\%} average success on LIBERO, \textbf{52.6\%} on RoboCasa Human-50, and \textbf{50.0\%} in real-world evaluation.
Problem

Research questions and friction points this paper is trying to address.

3D geometry
dense visual structure
environment evolution
robotic manipulation
spatio-temporal supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D Gaussian Splatting
World Model
Vision-Language-Action
Spatio-Temporal Prediction
Robot Manipulation
Z
Zijian Zhang
Tuojing Intelligence, University of Chinese Academy of Sciences, Institute of Automation, Chinese Academy of Sciences
Y
Yuqing Jiang
Tuojing Intelligence, University of Chinese Academy of Sciences, Institute of Automation, Chinese Academy of Sciences
Qian Cheng
Qian Cheng
University of Leeds
sustainable developmentcolour science
Si Liu
Si Liu
Fred Hutchinson Cancer Center
GenomicsBiostatisticsAnomaly DetectionOpen Category Detection
Ding Zhao
Ding Zhao
Carnegie Mellon University
Trustworthy AIAI safetyreinforcement learningautonomous vehiclesrobotics
Ping Luo
Ping Luo
National University of Defense Technology
distributed_computing
Weitao Zhou
Weitao Zhou
Tsinghua University
Autonomous DrivingReinforcement Learning
H
Haibao Yu
The University of Hong Kong, Tuojing Intelligence