🤖 AI Summary
This work addresses the limitations of existing panoramic stitching methods, which rely on pairwise feature matching and often fail to maintain multi-view geometric consistency in complex scenes characterized by weak textures, large disparities, or repetitive patterns, leading to misalignments and distortions. To overcome these challenges, the authors propose a photogrammetry-driven global alignment framework that leverages estimated camera poses to align images in 3D space. They introduce a novel 3D-aware Transformer architecture that explicitly models multi-view geometric consistency through joint feature optimization and cross-view information aggregation. Key contributions include the first formulation of multi-view consistency in 3D space, a Transformer-based 3D-aware stitching network, and the creation of the first large-scale real-world panoramic stitching dataset. Experiments demonstrate that the proposed method significantly outperforms state-of-the-art approaches in both alignment accuracy and visual quality, particularly exhibiting superior robustness and consistency in challenging scenarios.
📝 Abstract
Prior panorama stitching approaches heavily rely on pairwise feature correspondences and are unable to leverage geometric consistency across multiple views. This leads to severe distortion and misalignment, especially in challenging scenes with weak textures, large parallax, and repetitive patterns. Given that multi-view geometric correspondences can be directly constructed in 3D space, making them more accurate and globally consistent, we extend the 2D alignment task to the 3D photogrammetric space. We adopt a novel transformer-based architecture to achieve 3D awareness and aggregate global information across all views. It directly utilizes camera poses to guide image warping for global alignment in 3D space and employs a multi-feature joint optimization strategy to compute the seams. Additionally, to establish an evaluation benchmark and train our network, we constructed a large-scale dataset of real-world scenes. Extensive experiments show that our method significantly outperforms existing alternatives in alignment accuracy and perceptual quality.