Feed-Forward Gaussian Splatting from Sparse Aerial Views

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the challenges of ghosting artifacts, facade melting, and texture stretching commonly encountered in urban scene reconstruction from sparse aerial views. To this end, we propose AnyCity, a framework that fuses observation-supported geometric latent variables with scaffold-conditioned aerial completion tokens in a single forward pass. A gated residual mechanism selectively updates weakly constrained regions, while an observation-anchoring strategy explicitly disentangles observed geometry from prior-generated content. By integrating 3D Gaussian splatting, an aerial-adapted video diffusion prior, and an observation-preserving objective function, AnyCity significantly outperforms existing feed-forward methods on both synthetic and real-world scenes, achieving high-quality novel view synthesis at sub-second inference speeds.

📝 Abstract

Reconstructing large-scale urban scenes from sparse aerial views is a crucial yet challenging task. Due to biased top-down and shallow-oblique camera poses, sparse aerial captures exhibit strong evidence imbalance: roofs and open regions are repeatedly observed, while facades, distant buildings, and occluded structures receive little multi-view support. Existing feed-forward 3D Gaussian Splatting methods directly regress a deterministic representation from sparse inputs, but this often leads to ghosting, melted facades, and stretched textures. Recent pseudo-view and video-based generative reconstruction methods use additional supervision or generative priors. However, they often lack a clear separation between observed geometry and prior-driven content, which can lead to plausible but inconsistent structures. We propose AnyCity, an observation-grounded generative reconstruction framework for sparse aerial urban scenes. AnyCity first predicts an observation-supported geometry latent to anchor reliable structures, and then uses scaffold-conditioned aerial completion tokens to predict a gated residual update for weakly constrained content before Gaussian decoding. During training, dense-to-sparse distillation transfers structural cues from dense-view reconstruction, while an aerial-adapted video diffusion prior provides fine-grained urban appearance cues through gated token conditioning. Observation-preserving objectives keep the refined representation consistent with input-supported geometry. At inference time, AnyCity reconstructs the final 3D Gaussian scene from sparse aerial views in a single feed-forward pass, achieving coherent urban novel-view synthesis with second-level inference. Experiments on synthetic, aerial-domain, UAV-textured, and real-world scenes show consistent improvements over feed-forward baselines.

Problem

Research questions and friction points this paper is trying to address.

sparse aerial views

urban scene reconstruction

3D Gaussian Splatting

observation imbalance

generative reconstruction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gaussian Splatting

Sparse Aerial Reconstruction

Generative Prior