๐ค AI Summary
Robot manipulation suffers from scarce demonstration data and heavy reliance on action-specific pretraining. Method: This paper proposes the Physics-Autoregressive Model (PAR), the first approach to directly transfer physical-world knowledge embedded in video pretraining to robot control. PAR introduces learnable physics tokens that jointly represent visual frames and joint actions, employs a DiT-based architecture for continuous token decoding, and incorporates causal masking, inverse-kinematics constraints, and KV-caching to ensure physical plausibility and inference efficiency. Crucially, PAR requires no action pretrainingโonly raw video data is used to jointly model environment-robot dynamics. Results: On the ManiSkill benchmark, PAR achieves 100% success rate on PushCube and matches or exceeds action-pretrained baselines on other tasks. It also demonstrates high-fidelity video prediction and generates action trajectories closely aligned with real-world physical motion.
๐ Abstract
The scarcity of manipulation data has motivated the use of pretrained large models from other modalities in robotics. In this work, we build upon autoregressive video generation models to propose a Physical Autoregressive Model (PAR), where physical tokens combine frames and actions to represent the joint evolution of the robot and its environment. PAR leverages the world knowledge embedded in video pretraining to understand physical dynamics without requiring action pretraining, enabling accurate video prediction and consistent action trajectories. It also adopts a DiT-based de-tokenizer to model frames and actions as continuous tokens, mitigating quantization errors and facilitating mutual enhancement. Furthermore, we incorporate a causal mask with inverse kinematics, parallel training, and the KV-cache mechanism to further improve performance and efficiency. Experiments on the ManiSkill benchmark show that PAR achieves a 100% success rate on the PushCube task, matches the performance of action-pretrained baselines on other tasks, and accurately predicts future videos with tightly aligned action trajectories. These findings underscore a promising direction for robotic manipulation by transferring world knowledge from autoregressive video pretraining.