EgoWorld: Translating Exocentric View to Egocentric View using Rich Exocentric Observations

📅 2025-06-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses unrealistic assumptions—such as reliance on 2D cues, multi-view synchronization, and known initial egocentric pose—in existing third-person-to-first-person (exocentric-to-egocentric) viewpoint translation methods. We propose a generalizable framework, EgoWorld, that requires neither an initial egocentric reference frame nor explicit camera pose estimation. Our two-stage approach first constructs a geometrically consistent 3D hand-object interaction representation by jointly leveraging depth estimation, point cloud reconstruction, and reprojection. The second stage employs a text-guided diffusion model for egocentric image completion. Given only a single third-person RGB image, 3D hand pose, and natural language description as external inputs, EgoWorld achieves strong generalization across novel objects, actions, scenes, and users. It attains state-of-the-art performance on the H2O and TACO benchmarks and demonstrates practical utility on unlabeled real-world data.

Technology Category

Application Category

📝 Abstract
Egocentric vision is essential for both human and machine visual understanding, particularly in capturing the detailed hand-object interactions needed for manipulation tasks. Translating third-person views into first-person views significantly benefits augmented reality (AR), virtual reality (VR) and robotics applications. However, current exocentric-to-egocentric translation methods are limited by their dependence on 2D cues, synchronized multi-view settings, and unrealistic assumptions such as necessity of initial egocentric frame and relative camera poses during inference. To overcome these challenges, we introduce EgoWorld, a novel two-stage framework that reconstructs an egocentric view from rich exocentric observations, including projected point clouds, 3D hand poses, and textual descriptions. Our approach reconstructs a point cloud from estimated exocentric depth maps, reprojects it into the egocentric perspective, and then applies diffusion-based inpainting to produce dense, semantically coherent egocentric images. Evaluated on the H2O and TACO datasets, EgoWorld achieves state-of-the-art performance and demonstrates robust generalization to new objects, actions, scenes, and subjects. Moreover, EgoWorld shows promising results even on unlabeled real-world examples.
Problem

Research questions and friction points this paper is trying to address.

Translating third-person views to first-person views for AR/VR and robotics
Overcoming limitations of 2D cues and synchronized multi-view settings
Reconstructing egocentric views from rich exocentric observations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reconstructs point cloud from exocentric depth maps
Reprojects point cloud into egocentric perspective
Uses diffusion-based inpainting for coherent images
🔎 Similar Papers
No similar papers found.