WorldWander: Bridging Egocentric and Exocentric Worlds in Video Generation

📅 2025-11-26

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This paper addresses the challenge of seamless viewpoint transition between egocentric (first-person) and exocentric (third-person) perspectives in video generation. To this end, we propose a novel multi-view joint modeling framework. Our method introduces (1) an in-context perspective alignment mechanism to ensure temporal synchronization across viewpoints, and (2) collaborative position encoding to enhance spatial consistency of agents and scenes. We establish the first context-learning framework tailored for multi-view video generation and release EgoExo-8K, a large-scale, multi-view benchmark dataset. Implemented atop a video diffusion Transformer, our approach enables simultaneous egocentric and exocentric video synthesis. Extensive experiments demonstrate significant improvements in cross-view temporal coherence and visual fidelity on both synthetic and real-world scenes, achieving state-of-the-art performance across multiple benchmarks. This work establishes a new paradigm for viewpoint transfer in embodied AI and world modeling research.

Technology Category

Application Category

📝 Abstract

Video diffusion models have recently achieved remarkable progress in realism and controllability. However, achieving seamless video translation across different perspectives, such as first-person (egocentric) and third-person (exocentric), remains underexplored. Bridging these perspectives is crucial for filmmaking, embodied AI, and world models. Motivated by this, we present WorldWander, an in-context learning framework tailored for translating between egocentric and exocentric worlds in video generation. Building upon advanced video diffusion transformers, WorldWander integrates (i) In-Context Perspective Alignment and (ii) Collaborative Position Encoding to efficiently model cross-view synchronization. To further support our task, we curate EgoExo-8K, a large-scale dataset containing synchronized egocentric-exocentric triplets from both synthetic and real-world scenarios. Experiments demonstrate that WorldWander achieves superior perspective synchronization, character consistency, and generalization, setting a new benchmark for egocentric-exocentric video translation.

Problem

Research questions and friction points this paper is trying to address.

Bridging first-person and third-person video perspectives

Achieving seamless cross-view video translation

Enhancing synchronization and consistency in multi-perspective videos

Innovation

Methods, ideas, or system contributions that make the work stand out.

In-context learning framework for perspective translation

Integrates perspective alignment and collaborative position encoding

Uses large-scale synchronized egocentric-exocentric dataset

🔎 Similar Papers

No similar papers found.