Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation

📅 2026-02-18

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

257K/year

🤖 AI Summary

This work addresses the scarcity of real-world data in robotic manipulation by introducing PhysGen, a framework that leverages pretrained video generation models as implicit physics simulators. PhysGen models the dynamic interaction between environments and actions through autoregressive video generation, featuring a novel multimodal continuous physical token representation that unifies the semantic spaces of visual observations and continuous actions. This enables knowledge transfer from purely vision-based pretraining to robotic control without requiring action-labeled pretraining data. By integrating causal masking, inverse kinematics, and Lookahead multi-token prediction, PhysGen outperforms OpenVLA and WorldVLA by 13.8% and 8.8% on the Libero and ManiSkill benchmarks, respectively, and matches the performance of large models like π₀ in real-world settings—particularly excelling in challenging tasks such as grasping transparent objects.

Technology Category

Application Category

📝 Abstract

The scarcity of large-scale robotic data has motivated the repurposing of foundation models from other modalities for policy learning. In this work, we introduce PhysGen (Learning Physics from Pretrained Video Generation Models), a scalable continuous and sequential world interaction framework that leverages autoregressive video generation to solve robotic manipulation tasks. By treating the pretrained video model as a proxy for a physics simulator, PhysGen models the dynamic interplay between the external environment and robot actions. We introduce a multimodal continuous representation that unifies video and action into shared physical tokens, bridging the gap between discrete video generation and continuous robotic control. This approach enables the seamless transfer of implicit physical knowledge-such as object permanence and dynamics-from video pretraining to downstream manipulation.To ensure efficient convergence, we incorporate causal masking, inverse kinematics, Lookahead Multi-Token Prediction (L-MTP), and key-value (KV) caching. Experimental results on the Libero and ManiSkill benchmarks demonstrate that PhysGen consistently outperforms robust baselines, surpassing OpenVLA and WorldVLA by margins of 13.8% and 8.8%, respectively. Notably, in real-world scenarios, PhysGen matches the performance of large-scale action-pretrained models like $\pi_0$ without requiring prior action-specific pretraining, demonstrating superior capability in physically complex tasks such as grasping transparent objects. These findings validate the potential of extracting physical intuition from pretrained video generators to facilitate generalizable robotic manipulation.

Problem

Research questions and friction points this paper is trying to address.

robotic manipulation

physics learning

pretrained video models

multimodal representation

physical intuition

Innovation

Methods, ideas, or system contributions that make the work stand out.

pretrained video models

multimodal continuous representation

physics simulation