π€ AI Summary
This work addresses the challenge of geometric inconsistency in zero-shot panoramic depth estimation by proposing the first framework based on the VGGT foundation model, reframing depth estimation as a multi-view 3D reprojection task. The approach introduces three key innovations: uncertainty-guided adaptive projection, structure-aware attention, and correlation-weighted 3D model refinement, enabling end-to-end inference from panorama to 3D geometry and back to depthβall without any training. This design inherently unifies local view reasoning with global geometric consistency. Extensive experiments demonstrate that the method consistently outperforms both learning-based and zero-shot alternatives across diverse indoor and outdoor datasets and resolutions, achieving state-of-the-art performance in both geometric coherence and depth accuracy.
π Abstract
This paper presents VGGT-360, a novel training-free framework for zero-shot, geometry-consistent panoramic depth estimation. Unlike prior view-independent training-free approaches, VGGT-360 reformulates the task as panoramic reprojection over multi-view reconstructed 3D models by leveraging the intrinsic 3D consistency of VGGT-like foundation models, thereby unifying fragmented per-view reasoning into a coherent panoramic understanding. To achieve robust and accurate estimation, VGGT-360 integrates three plug-and-play modules that form a unified panorama-to-3D-to-depth framework: (i) Uncertainty-guided adaptive projection slices panoramas into perspective views to bridge the domain gap between panoramic inputs and VGGT's perspective prior. It estimates gradient-based uncertainty to allocate denser views to geometry-poor regions, yielding geometry-informative inputs for VGGT. (ii) Structure-saliency enhanced attention strengthens VGGT's robustness during 3D reconstruction by injecting structure-aware confidence into its attention layers, guiding focus toward geometrically reliable regions and enhancing cross-view coherence. (iii) Correlation-weighted 3D model correction refines the reconstructed 3D model by reweighting overlapping points using attention-inferred correlation scores, providing a consistent geometric basis for accurate panoramic reprojection. Extensive experiments show that VGGT-360 outperforms both trained and training-free state-of-the-art methods across multiple resolutions and diverse indoor and outdoor datasets.