🤖 AI Summary
This work addresses the failure of monocular priors in panoramic stereo matching caused by data scarcity and spherical distortion. To overcome these challenges, the authors construct a large-scale synthetic panoramic stereo dataset and propose a monocular normal estimator operating in a heading-aligned coordinate system, which provides geometric priors under zero-shot settings. By leveraging heading-aligned normal priors, the method achieves cross-view consistency and robustness to distortion, effectively mitigating performance degradation due to mismatches between training and testing fields of view. Experiments demonstrate that the model outperforms existing approaches on out-of-domain datasets and successfully generalizes to real-world consumer-grade panoramic camera systems. The code and dataset will be publicly released.
📝 Abstract
Stereo matching on top-bottom equirectangular images provides an effective framework for full-surround perception, as vertically aligned epipolar lines enable the use of advanced perspective stereo architectures that are largely driven by large-scale datasets and monocular priors. However, the performance of such adaptations is severely limited by the scarcity of omnidirectional stereo datasets and the degradation of perspective monocular priors under spherical distortions.To address these challenges, we propose H-OmniStereo, a zero-shot omnidirectional stereo matching framework. First, we construct high-quality synthetic dataset comprising over 2.8 million top-bottom equirectangular stereo pairs to scale up training. Second, we introduce an equirectangular monocular normal estimator, specifically operating in a heading-aligned coordinate system. Beyond providing distortion-robust and cross-view-consistent geometric priors for establishing reliable correspondences in stereo matching, this design boosts training efficiency and accommodates train-test FoV mismatches.Extensive experiments show that our approach achieves higher accuracy than existing methods on out-of-domain datasets and successfully generalizes to real-world consumer camera setups using a single model. Both the model and the dataset will be open-sourced.