EgoExo-WM: Unlocking Exo Video for Ego World Models

📅 2026-05-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

211K/year
🤖 AI Summary
This work addresses the performance limitations of egocentric world models caused by scarce training data and the partial observability of human actions. To overcome these challenges, the authors propose a novel approach that leverages exocentric in-the-wild videos to enhance egocentric world models. Specifically, structured human poses are extracted from exocentric videos and, using kinematic priors, transformed into egocentric action representations, which are then used as conditional inputs during world model training. This method enables, for the first time, the effective utilization of arbitrary exocentric in-the-wild videos without requiring viewpoint alignment. Experimental results demonstrate significant improvements in the model’s accuracy in predicting future visual states and its capability in action planning for downstream tasks.
📝 Abstract
Egocentric world models present a promising direction for enabling agents to predict and plan, but their performance is constrained by the limited availability of egocentric training data and its inherent partial observability of humans'physical actions. In contrast, exocentric video is abundant and reveals body poses well, but lacks direct alignment with an agent's action space -- and is not egocentric. We propose a method to bridge this gap by extracting structured body pose from exocentric video as a representation of action and transforming the exocentric video to egocentric video, informed by a human kinematics prior. This process unlocks the integration of in-the-wild exocentric data for egocentric world model training. We show that training whole-body action-conditioned egocentric world models with our converted data significantly improves both prediction quality and downstream planning performance, where we infer the sequence of body poses needed to achieve a visual goal state. Our approach paves the way to enlist arbitrary in-the-wild videos for building powerful egocentric world models, furthering applications in robot planning and augmented-reality guidance.
Problem

Research questions and friction points this paper is trying to address.

egocentric world models
exocentric video
action representation
partial observability
data scarcity
Innovation

Methods, ideas, or system contributions that make the work stand out.

egocentric world models
exocentric-to-egocentric translation
body pose representation
human kinematics prior
action-conditioned video prediction