EgoX: Egocentric Video Generation from a Single Exocentric Video

📅 2025-12-09

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

This work addresses the challenge of generating first-person videos from single third-person videos, where large pose discrepancies and limited viewpoint overlap cause severe geometric and content distortions. To tackle this, we propose a unified conditional modeling framework that integrates dual-view priors, a geometry-guided self-attention mechanism to explicitly enforce spatial consistency, and lightweight LoRA fine-tuning of a pre-trained spatiotemporal video diffusion model. Additionally, we introduce channel-width joint feature concatenation to enhance representational capacity. Evaluated across multiple unseen scenes and real-world environments, our method produces videos with high visual fidelity and strong geometric consistency. It demonstrates superior generalization and robustness compared to prior approaches, particularly under significant viewpoint and pose variations.

Technology Category

Application Category

📝 Abstract

Egocentric perception enables humans to experience and understand the world directly from their own point of view. Translating exocentric (third-person) videos into egocentric (first-person) videos opens up new possibilities for immersive understanding but remains highly challenging due to extreme camera pose variations and minimal view overlap. This task requires faithfully preserving visible content while synthesizing unseen regions in a geometrically consistent manner. To achieve this, we present EgoX, a novel framework for generating egocentric videos from a single exocentric input. EgoX leverages the pretrained spatio temporal knowledge of large-scale video diffusion models through lightweight LoRA adaptation and introduces a unified conditioning strategy that combines exocentric and egocentric priors via width and channel wise concatenation. Additionally, a geometry-guided self-attention mechanism selectively attends to spatially relevant regions, ensuring geometric coherence and high visual fidelity. Our approach achieves coherent and realistic egocentric video generation while demonstrating strong scalability and robustness across unseen and in-the-wild videos.

Problem

Research questions and friction points this paper is trying to address.

Generate egocentric videos from single exocentric inputs

Address extreme camera pose variations and minimal view overlap

Preserve visible content while synthesizing unseen regions consistently

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages pretrained video diffusion models via LoRA adaptation

Uses unified conditioning with exocentric and egocentric priors

Implements geometry-guided self-attention for coherence and fidelity

🔎 Similar Papers

MM-Ego: Towards Building Egocentric Multimodal LLMs