🤖 AI Summary
This paper addresses the challenge of seamless viewpoint transition between egocentric (first-person) and exocentric (third-person) perspectives in video generation. To this end, we propose a novel multi-view joint modeling framework. Our method introduces (1) an in-context perspective alignment mechanism to ensure temporal synchronization across viewpoints, and (2) collaborative position encoding to enhance spatial consistency of agents and scenes. We establish the first context-learning framework tailored for multi-view video generation and release EgoExo-8K, a large-scale, multi-view benchmark dataset. Implemented atop a video diffusion Transformer, our approach enables simultaneous egocentric and exocentric video synthesis. Extensive experiments demonstrate significant improvements in cross-view temporal coherence and visual fidelity on both synthetic and real-world scenes, achieving state-of-the-art performance across multiple benchmarks. This work establishes a new paradigm for viewpoint transfer in embodied AI and world modeling research.
📝 Abstract
Video diffusion models have recently achieved remarkable progress in realism and controllability. However, achieving seamless video translation across different perspectives, such as first-person (egocentric) and third-person (exocentric), remains underexplored. Bridging these perspectives is crucial for filmmaking, embodied AI, and world models. Motivated by this, we present WorldWander, an in-context learning framework tailored for translating between egocentric and exocentric worlds in video generation. Building upon advanced video diffusion transformers, WorldWander integrates (i) In-Context Perspective Alignment and (ii) Collaborative Position Encoding to efficiently model cross-view synchronization. To further support our task, we curate EgoExo-8K, a large-scale dataset containing synchronized egocentric-exocentric triplets from both synthetic and real-world scenarios. Experiments demonstrate that WorldWander achieves superior perspective synchronization, character consistency, and generalization, setting a new benchmark for egocentric-exocentric video translation.