🤖 AI Summary
This work addresses the challenge of generating temporally and geometrically coherent videos between synchronized third-person (exo) and first-person (ego) views, which often suffer from spatiotemporal and geometric discontinuities. To this end, the authors propose Syn2Seq-Forcing, a novel framework that unifies cross-view video generation as a single continuous sequence modeling task. By constructing interpolation signals between source and target videos and leveraging diffusion sequence models—such as Diffusion Forcing Transformers—the method directly learns smooth inter-frame transitions without relying on explicit pose interpolation, thereby mitigating temporal jitters during viewpoint switches. The approach significantly enhances both temporal coherence and visual quality of the generated videos, establishing a general and scalable foundation for bidirectional Exo↔Ego synthesis.
📝 Abstract
Exo-to-Ego video generation aims to synthesize a first-person video from a synchronized third-person view and corresponding camera poses. While paired supervision is available, synchronized exo-ego data inherently introduces substantial spatio-temporal and geometric discontinuities, violating the smooth-motion assumptions of standard video generation benchmarks. We identify this synchronization-induced jump as the central challenge and propose Syn2Seq-Forcing, a sequential formulation that interpolates between the source and target videos to form a single continuous signal. By reframing Exo2Ego as sequential signal modeling rather than a conventional condition-output task, our approach enables diffusion-based sequence models, e.g. Diffusion Forcing Transformers (DFoT), to capture coherent transitions across frames more effectively. Empirically, we show that interpolating only the videos, without performing pose interpolation already produces significant improvements, emphasizing that the dominant difficulty arises from spatio-temporal discontinuities. Beyond immediate performance gains, this formulation establishes a general and flexible framework capable of unifying both Exo2Ego and Ego2Exo generation within a single continuous sequence model, providing a principled foundation for future research in cross-view video synthesis.