Robot Learning from a Physical World Model

๐Ÿ“… 2025-11-10
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Video generation models produce pixel-level motions lacking physical consistency, leading to inaccurate robotic manipulation. Method: This paper proposes a zero-shot generalization framework that synergistically integrates visual generation with physical-world reconstruction. Given a language instruction and a single input image, it first generates a task-conditioned video demonstration; then reconstructs an object-level physical scene to map implicit visual actions into dynamically feasible, executable trajectories; finally, employs object-centric residual reinforcement learning to refine control policies in simulationโ€”all without requiring any real-robot interaction data. Contribution/Results: Experiments demonstrate significant improvements in precision across diverse real-world manipulation tasks. To our knowledge, this is the first approach enabling physically plausible, zero-shot robotic manipulation driven purely by vision-based generation.

Technology Category

Application Category

๐Ÿ“ Abstract
We introduce PhysWorld, a framework that enables robot learning from video generation through physical world modeling. Recent video generation models can synthesize photorealistic visual demonstrations from language commands and images, offering a powerful yet underexplored source of training signals for robotics. However, directly retargeting pixel motions from generated videos to robots neglects physics, often resulting in inaccurate manipulations. PhysWorld addresses this limitation by coupling video generation with physical world reconstruction. Given a single image and a task command, our method generates task-conditioned videos and reconstructs the underlying physical world from the videos, and the generated video motions are grounded into physically accurate actions through object-centric residual reinforcement learning with the physical world model. This synergy transforms implicit visual guidance into physically executable robotic trajectories, eliminating the need for real robot data collection and enabling zero-shot generalizable robotic manipulation. Experiments on diverse real-world tasks demonstrate that PhysWorld substantially improves manipulation accuracy compared to previous approaches. Visit href{https://pointscoder.github.io/PhysWorld_Web/}{the project webpage} for details.
Problem

Research questions and friction points this paper is trying to address.

Converting video-generated motions into physically accurate robotic manipulations
Eliminating the need for real robot data collection during training
Enabling zero-shot generalizable robotic manipulation through physical grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates task-conditioned videos from single images
Reconstructs physical world models from generated videos
Uses object-centric residual reinforcement learning for actions
๐Ÿ”Ž Similar Papers
No similar papers found.
Jiageng Mao
Jiageng Mao
University of Southern California
RoboticsComputer Vision
S
Sicheng He
USC
H
Hao-Ning Wu
Google DeepMind
Yang You
Yang You
Postdoc, Stanford University
3D visioncomputer graphicscomputational geometry
Shuyang Sun
Shuyang Sun
Google DeepMind
Computer VisionPattern RecognitionMachine Learning
Z
Zhicheng Wang
Google DeepMind
Yanan Bao
Yanan Bao
Google Deepmind
Machine LearningData MiningGreen Communications
H
Huizhong Chen
Google DeepMind
L
Leonidas J. Guibas
Google DeepMind, Stanford
V
V. Guizilini
Toyota Research Institute
H
Howard Zhou
Google DeepMind
Y
Yue Wang
USC