🤖 AI Summary
This work addresses the limited robustness of image-driven stereo matching in zero-shot generalization from synthetic to real-world scenes—particularly in occluded, textureless, repetitive, and non-Lambertian regions—by leveraging surface normals as a domain-invariant geometric prior. The authors propose a gated context-geometry fusion (GCGF) module and a specular-transparency augmentation (STA) strategy, complemented by sparse spatial encoding, dual matching, and voxel-based attention mechanisms. Without any training on real data, the method reduces errors by 30%, 8.5%, and 14.1% on ETH3D, a non-Lambertian augmented benchmark, and KITTI-2015, respectively, while achieving a 19.2% improvement in inference speed and enabling efficient high-resolution (up to 3K) stereo matching on the Middlebury dataset.
📝 Abstract
Despite remarkable advances in image-driven stereo matching over the past decade, Synthetic-to-Realistic Zero-Shot (Syn-to-Real) generalization remains an open challenge. This suboptimal generalization performance mainly stems from cross-domain shifts and ill-posed ambiguities inherent in image textures, particularly in occluded, textureless, repetitive, and non-Lambertian (specular/transparent) regions. To improve Syn-to-Real generalization, we propose GREATEN, a framework that incorporates surface normals as domain-invariant, object-intrinsic, and discriminative geometric cues to compensate for the limitations of image textures. The proposed framework consists of three key components. First, a Gated Contextual-Geometric Fusion (GCGF) module adaptively suppresses unreliable contextual cues in image features and fuses the filtered image features with normal-driven geometric features to construct domain-invariant and discriminative contextual-geometric representations. Second, a Specular-Transparent Augmentation (STA) strategy improves the robustness of GCGF against misleading visual cues in non-Lambertian regions. Third, sparse attention designs preserve the fine-grained global feature extraction capability of GREAT-Stereo for handling occlusion and texture-related ambiguities while substantially reducing computational overhead, including Sparse Spatial (SSA), Sparse Dual-Matching (SDMA), and Simple Volume (SVA) attentions. Trained exclusively on synthetic data such as SceneFlow, GREATEN-IGEV achieves outstanding Syn-to-Real performance. Specifically, it reduces errors by 30% on ETH3D, 8.5% on the non-Lambertian Booster, and 14.1% on KITTI-2015, compared to FoundationStereo, Monster-Stereo, and DEFOM-Stereo, respectively. In addition, GREATEN-IGEV runs 19.2% faster than GREAT-IGEV and supports high-resolution (3K) inference on Middlebury with disparity ranges up to 768.