🤖 AI Summary
This work addresses the challenge of enabling robots to accurately predict how objects in 3D environments respond to low-level actions from a single in-the-wild image, without requiring demonstrations. The authors propose PointWorld, the first large-scale pre-trained 3D world model that unifies states and actions into geometric-aware 3D point flows, enabling end-to-end prediction of pixel-aligned 3D displacements within a shared spatial representation. By integrating RGB-D inputs, hybrid real-simulated data (2 million trajectories), multitask visuomotor learning, and model-predictive control, PointWorld represents robot actions as 3D point flows, facilitating cross-embodiment generalization. A single pre-trained model, deployed on a real Franka robot without fine-tuning or demonstrations, successfully performs diverse tasks—including rigid-body pushing, manipulation of deformable and articulated objects, and tool use—at an inference speed of 0.1 seconds.
📝 Abstract
Humans anticipate, from a glance and a contemplated action of their bodies, how the 3D world will respond, a capability that is equally vital for robotic manipulation. We introduce PointWorld, a large pre-trained 3D world model that unifies state and action in a shared 3D space as 3D point flows: given one or few RGB-D images and a sequence of low-level robot action commands, PointWorld forecasts per-pixel displacements in 3D that respond to the given actions. By representing actions as 3D point flows instead of embodiment-specific action spaces (e.g., joint positions), this formulation directly conditions on physical geometries of robots while seamlessly integrating learning across embodiments. To train our 3D world model, we curate a large-scale dataset spanning real and simulated robotic manipulation in open-world environments, enabled by recent advances in 3D vision and simulated environments, totaling about 2M trajectories and 500 hours across a single-arm Franka and a bimanual humanoid. Through rigorous, large-scale empirical studies of backbones, action representations, learning objectives, partial observability, data mixtures, domain transfers, and scaling, we distill design principles for large-scale 3D world modeling. With a real-time (0.1s) inference speed, PointWorld can be efficiently integrated in the model-predictive control (MPC) framework for manipulation. We demonstrate that a single pre-trained checkpoint enables a real-world Franka robot to perform rigid-body pushing, deformable and articulated object manipulation, and tool use, without requiring any demonstrations or post-training and all from a single image captured in-the-wild. Project website at https://point-world.github.io/.