WorldWander: Bridging Egocentric and Exocentric Worlds in Video Generation

📅 2025-11-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of seamless viewpoint transition between egocentric (first-person) and exocentric (third-person) perspectives in video generation. To this end, we propose a novel multi-view joint modeling framework. Our method introduces (1) an in-context perspective alignment mechanism to ensure temporal synchronization across viewpoints, and (2) collaborative position encoding to enhance spatial consistency of agents and scenes. We establish the first context-learning framework tailored for multi-view video generation and release EgoExo-8K, a large-scale, multi-view benchmark dataset. Implemented atop a video diffusion Transformer, our approach enables simultaneous egocentric and exocentric video synthesis. Extensive experiments demonstrate significant improvements in cross-view temporal coherence and visual fidelity on both synthetic and real-world scenes, achieving state-of-the-art performance across multiple benchmarks. This work establishes a new paradigm for viewpoint transfer in embodied AI and world modeling research.

Technology Category

Application Category

📝 Abstract
Video diffusion models have recently achieved remarkable progress in realism and controllability. However, achieving seamless video translation across different perspectives, such as first-person (egocentric) and third-person (exocentric), remains underexplored. Bridging these perspectives is crucial for filmmaking, embodied AI, and world models. Motivated by this, we present WorldWander, an in-context learning framework tailored for translating between egocentric and exocentric worlds in video generation. Building upon advanced video diffusion transformers, WorldWander integrates (i) In-Context Perspective Alignment and (ii) Collaborative Position Encoding to efficiently model cross-view synchronization. To further support our task, we curate EgoExo-8K, a large-scale dataset containing synchronized egocentric-exocentric triplets from both synthetic and real-world scenarios. Experiments demonstrate that WorldWander achieves superior perspective synchronization, character consistency, and generalization, setting a new benchmark for egocentric-exocentric video translation.
Problem

Research questions and friction points this paper is trying to address.

Bridging first-person and third-person video perspectives
Achieving seamless cross-view video translation
Enhancing synchronization and consistency in multi-perspective videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

In-context learning framework for perspective translation
Integrates perspective alignment and collaborative position encoding
Uses large-scale synchronized egocentric-exocentric dataset
🔎 Similar Papers
No similar papers found.
Q
Quanjian Song
Show Lab, National University of Singapore
Yiren Song
Yiren Song
PH.D student, National University of Singapore
Generative AIDiffusionUnified model
Kelly Peng
Kelly Peng
University of California at Berkeley
diffusion modelsvideo-genfoundation modelsmultimodal understanding
Y
Yuan Gao
First Intelligence
M
Mike Zheng Shou
Show Lab, National University of Singapore