Walk through Paintings: Egocentric World Models from Internet Priors

📅 2026-01-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the key challenge in embodied intelligence of achieving controllable and physically consistent future video prediction. The authors propose EgoWM, a method that efficiently repurposes a general-purpose pretrained video diffusion model into an action-conditioned world model without retraining. By injecting action commands through lightweight conditional layers, EgoWM leverages the model’s inherent world priors to generate high-fidelity, egocentric future predictions. The approach supports multi-morphology embodiment with 3–25 degrees of freedom and introduces a Structural Consistency Score (SCS) to evaluate physical plausibility. Experiments demonstrate that EgoWM improves SCS by up to 80% in navigation tasks, reduces inference latency to one-sixth of the previous state-of-the-art, and exhibits strong generalization to unseen environments—including interiors of paintings.

Technology Category

Application Category

📝 Abstract
What if a video generation model could not only imagine a plausible future, but the correct one, accurately reflecting how the world changes with each action? We address this question by presenting the Egocentric World Model (EgoWM), a simple, architecture-agnostic method that transforms any pretrained video diffusion model into an action-conditioned world model, enabling controllable future prediction. Rather than training from scratch, we repurpose the rich world priors of Internet-scale video models and inject motor commands through lightweight conditioning layers. This allows the model to follow actions faithfully while preserving realism and strong generalization. Our approach scales naturally across embodiments and action spaces, ranging from 3-DoF mobile robots to 25-DoF humanoids, where predicting egocentric joint-angle-driven dynamics is substantially more challenging. The model produces coherent rollouts for both navigation and manipulation tasks, requiring only modest fine-tuning. To evaluate physical correctness independently of visual appearance, we introduce the Structural Consistency Score (SCS), which measures whether stable scene elements evolve consistently with the provided actions. EgoWM improves SCS by up to 80 percent over prior state-of-the-art navigation world models, while achieving up to six times lower inference latency and robust generalization to unseen environments, including navigation inside paintings.
Problem

Research questions and friction points this paper is trying to address.

egocentric world models
action-conditioned prediction
video diffusion models
physical consistency
embodied AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

Egocentric World Model
video diffusion model
action-conditioned prediction
Structural Consistency Score
internet-scale priors
🔎 Similar Papers
No similar papers found.
A
Anurag Bagchi
Carnegie Mellon University
Z
Zhipeng Bao
Carnegie Mellon University
Homanga Bharadhwaj
Homanga Bharadhwaj
Research Scientist, FAIR - AI at Meta
Artificial Intelligence
Y
Yu-Xiong Wang
University of Illinois Urbana-Champaign
P
P. Tokmakov
Toyota Research Institute
M
Martial Hebert
Carnegie Mellon University