Physical Autoregressive Model for Robotic Manipulation without Action Pretraining

๐Ÿ“… 2025-08-13
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

209K/year
๐Ÿค– AI Summary
Robot manipulation suffers from scarce demonstration data and heavy reliance on action-specific pretraining. Method: This paper proposes the Physics-Autoregressive Model (PAR), the first approach to directly transfer physical-world knowledge embedded in video pretraining to robot control. PAR introduces learnable physics tokens that jointly represent visual frames and joint actions, employs a DiT-based architecture for continuous token decoding, and incorporates causal masking, inverse-kinematics constraints, and KV-caching to ensure physical plausibility and inference efficiency. Crucially, PAR requires no action pretrainingโ€”only raw video data is used to jointly model environment-robot dynamics. Results: On the ManiSkill benchmark, PAR achieves 100% success rate on PushCube and matches or exceeds action-pretrained baselines on other tasks. It also demonstrates high-fidelity video prediction and generates action trajectories closely aligned with real-world physical motion.

Technology Category

Application Category

๐Ÿ“ Abstract
The scarcity of manipulation data has motivated the use of pretrained large models from other modalities in robotics. In this work, we build upon autoregressive video generation models to propose a Physical Autoregressive Model (PAR), where physical tokens combine frames and actions to represent the joint evolution of the robot and its environment. PAR leverages the world knowledge embedded in video pretraining to understand physical dynamics without requiring action pretraining, enabling accurate video prediction and consistent action trajectories. It also adopts a DiT-based de-tokenizer to model frames and actions as continuous tokens, mitigating quantization errors and facilitating mutual enhancement. Furthermore, we incorporate a causal mask with inverse kinematics, parallel training, and the KV-cache mechanism to further improve performance and efficiency. Experiments on the ManiSkill benchmark show that PAR achieves a 100% success rate on the PushCube task, matches the performance of action-pretrained baselines on other tasks, and accurately predicts future videos with tightly aligned action trajectories. These findings underscore a promising direction for robotic manipulation by transferring world knowledge from autoregressive video pretraining.
Problem

Research questions and friction points this paper is trying to address.

Develops Physical Autoregressive Model for robotic manipulation without action pretraining
Leverages video pretraining to understand physical dynamics and predict actions
Improves performance with DiT-based de-tokenizer and causal mask techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Physical tokens combine frames and actions
DiT-based de-tokenizer reduces quantization errors
Causal mask with inverse kinematics boosts efficiency
๐Ÿ”Ž Similar Papers