SatDreamer360: Geometry Consistent Street-View Video Generation from Satellite Imagery

📅 2025-05-31

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses core challenges—geometric distortion and temporal incoherence—in generating ground-level street-view videos from a single satellite image. To this end, we propose an end-to-end framework that eliminates reliance on explicit elevation maps or hand-crafted geometric projections. Methodologically, we introduce a compact triplane scene geometry representation, coupled with ray-driven pixel attention to enable robust cross-view geometric modeling; incorporate an epipolar-geometry-constrained temporal attention module to explicitly enforce inter-frame motion consistency; and unify the entire generation process within a diffusion-based architecture. Evaluated on our newly constructed VIGOR++ dataset, our approach achieves significant improvements in geometric alignment accuracy, temporal coherence, and visual fidelity. It enables high-quality, long-sequence street-view video synthesis even in complex urban environments.

Technology Category

Application Category

📝 Abstract

Generating continuous ground-level video from satellite imagery is a challenging task with significant potential for applications in simulation, autonomous navigation, and digital twin cities. Existing approaches primarily focus on synthesizing individual ground-view images, often relying on auxiliary inputs like height maps or handcrafted projections, and fall short in producing temporally consistent sequences. In this paper, we propose {SatDreamer360}, a novel framework that generates geometrically and temporally consistent ground-view video from a single satellite image and a predefined trajectory. To bridge the large viewpoint gap, we introduce a compact tri-plane representation that encodes scene geometry directly from the satellite image. A ray-based pixel attention mechanism retrieves view-dependent features from the tri-plane, enabling accurate cross-view correspondence without requiring additional geometric priors. To ensure multi-frame consistency, we propose an epipolar-constrained temporal attention module that aligns features across frames using the known relative poses along the trajectory. To support evaluation, we introduce {VIGOR++}, a large-scale dataset for cross-view video generation, with dense trajectory annotations and high-quality ground-view sequences. Extensive experiments demonstrate that SatDreamer360 achieves superior performance in fidelity, coherence, and geometric alignment across diverse urban scenes.

Problem

Research questions and friction points this paper is trying to address.

Generating consistent street-view video from satellite imagery

Bridging large viewpoint gaps without geometric priors

Ensuring multi-frame consistency in video sequences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compact tri-plane representation for geometry encoding

Ray-based pixel attention for cross-view correspondence

Epipolar-constrained temporal attention for multi-frame consistency

🔎 Similar Papers

Bird's-Eye View to Street-View: A Survey