🤖 AI Summary
This work addresses core challenges—geometric distortion and temporal incoherence—in generating ground-level street-view videos from a single satellite image. To this end, we propose an end-to-end framework that eliminates reliance on explicit elevation maps or hand-crafted geometric projections. Methodologically, we introduce a compact triplane scene geometry representation, coupled with ray-driven pixel attention to enable robust cross-view geometric modeling; incorporate an epipolar-geometry-constrained temporal attention module to explicitly enforce inter-frame motion consistency; and unify the entire generation process within a diffusion-based architecture. Evaluated on our newly constructed VIGOR++ dataset, our approach achieves significant improvements in geometric alignment accuracy, temporal coherence, and visual fidelity. It enables high-quality, long-sequence street-view video synthesis even in complex urban environments.
📝 Abstract
Generating continuous ground-level video from satellite imagery is a challenging task with significant potential for applications in simulation, autonomous navigation, and digital twin cities. Existing approaches primarily focus on synthesizing individual ground-view images, often relying on auxiliary inputs like height maps or handcrafted projections, and fall short in producing temporally consistent sequences. In this paper, we propose {SatDreamer360}, a novel framework that generates geometrically and temporally consistent ground-view video from a single satellite image and a predefined trajectory. To bridge the large viewpoint gap, we introduce a compact tri-plane representation that encodes scene geometry directly from the satellite image. A ray-based pixel attention mechanism retrieves view-dependent features from the tri-plane, enabling accurate cross-view correspondence without requiring additional geometric priors. To ensure multi-frame consistency, we propose an epipolar-constrained temporal attention module that aligns features across frames using the known relative poses along the trajectory. To support evaluation, we introduce {VIGOR++}, a large-scale dataset for cross-view video generation, with dense trajectory annotations and high-quality ground-view sequences. Extensive experiments demonstrate that SatDreamer360 achieves superior performance in fidelity, coherence, and geometric alignment across diverse urban scenes.