GaussianDream: A Feed-Forward 3D Gaussian World Model for Robotic Manipulation

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

239K/year

🤖 AI Summary

Existing vision-language action policies struggle to achieve high-precision physical manipulation due to the absence of explicit supervision on 3D geometry, dense visual structure, and temporal dynamics. This work introduces 3D Gaussian representations into robotic world modeling for the first time, proposing a feedforward 3D Gaussian world model plugin that jointly reconstructs current and predicted future 3D Gaussian states. This generates spatiotemporal supervision signals decodable into dense RGB, depth, and pseudo-3D scene flow. Notably, the method requires neither rendering nor planning at test time, instead leveraging only a lightweight spatiotemporal prefix to drive action generation. It achieves success rates of 98.4%, 52.6%, and 50.0% on LIBERO, RoboCasa Human-50, and real-world robot tasks, respectively, substantially outperforming existing approaches.

📝 Abstract

Vision-language-action (VLA) policies have advanced language-conditioned robotic manipulation by transferring semantic priors from pretrained vision-language models to action generation. Yet, standard action-imitation training often provides limited explicit supervision for 3D geometry, dense visual structure, and short-horizon environment evolution, which are critical for physically precise manipulation. We introduce \textbf{GaussianDream}, a feed-forward 3D Gaussian world-model plug-in that turns robot trajectories into structured spatial-temporal supervision. The key idea is to couple current Gaussian reconstruction with horizon-conditioned future Gaussian prediction during training, forcing a compact spatio-temporal prefix to be decodable into renderable 3D Gaussian states. This enables dense RGB rendering, depth, and pseudo 3D scene-flow supervision without requiring test-time Gaussian decoding. At inference, GaussianDream discards all auxiliary decoding heads and retains only the learned prefix to condition action generation, avoiding rendering, video rollout, or additional planning during closed-loop control. Experiments on LIBERO, RoboCasa Human-50, and real-robot tasks demonstrate strong and highly competitive performance, achieving \textbf{98.4\%} average success on LIBERO, \textbf{52.6\%} on RoboCasa Human-50, and \textbf{50.0\%} in real-world evaluation.

Problem

Research questions and friction points this paper is trying to address.

3D geometry

dense visual structure

environment evolution

robotic manipulation

spatio-temporal supervision

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D Gaussian Splatting

World Model

Vision-Language-Action

Spatio-Temporal Prediction

Robot Manipulation

🔎 Similar Papers

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey

2024-04-28arXiv.orgCitations: 15