Geometry Reinforced Efficient Attention Tuning Equipped with Normals for Robust Stereo Matching

📅 2026-04-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

221K/year
🤖 AI Summary
This work addresses the limited robustness of image-driven stereo matching in zero-shot generalization from synthetic to real-world scenes—particularly in occluded, textureless, repetitive, and non-Lambertian regions—by leveraging surface normals as a domain-invariant geometric prior. The authors propose a gated context-geometry fusion (GCGF) module and a specular-transparency augmentation (STA) strategy, complemented by sparse spatial encoding, dual matching, and voxel-based attention mechanisms. Without any training on real data, the method reduces errors by 30%, 8.5%, and 14.1% on ETH3D, a non-Lambertian augmented benchmark, and KITTI-2015, respectively, while achieving a 19.2% improvement in inference speed and enabling efficient high-resolution (up to 3K) stereo matching on the Middlebury dataset.

Technology Category

Application Category

📝 Abstract
Despite remarkable advances in image-driven stereo matching over the past decade, Synthetic-to-Realistic Zero-Shot (Syn-to-Real) generalization remains an open challenge. This suboptimal generalization performance mainly stems from cross-domain shifts and ill-posed ambiguities inherent in image textures, particularly in occluded, textureless, repetitive, and non-Lambertian (specular/transparent) regions. To improve Syn-to-Real generalization, we propose GREATEN, a framework that incorporates surface normals as domain-invariant, object-intrinsic, and discriminative geometric cues to compensate for the limitations of image textures. The proposed framework consists of three key components. First, a Gated Contextual-Geometric Fusion (GCGF) module adaptively suppresses unreliable contextual cues in image features and fuses the filtered image features with normal-driven geometric features to construct domain-invariant and discriminative contextual-geometric representations. Second, a Specular-Transparent Augmentation (STA) strategy improves the robustness of GCGF against misleading visual cues in non-Lambertian regions. Third, sparse attention designs preserve the fine-grained global feature extraction capability of GREAT-Stereo for handling occlusion and texture-related ambiguities while substantially reducing computational overhead, including Sparse Spatial (SSA), Sparse Dual-Matching (SDMA), and Simple Volume (SVA) attentions. Trained exclusively on synthetic data such as SceneFlow, GREATEN-IGEV achieves outstanding Syn-to-Real performance. Specifically, it reduces errors by 30% on ETH3D, 8.5% on the non-Lambertian Booster, and 14.1% on KITTI-2015, compared to FoundationStereo, Monster-Stereo, and DEFOM-Stereo, respectively. In addition, GREATEN-IGEV runs 19.2% faster than GREAT-IGEV and supports high-resolution (3K) inference on Middlebury with disparity ranges up to 768.
Problem

Research questions and friction points this paper is trying to address.

Stereo Matching
Synthetic-to-Real Generalization
Domain Shift
Non-Lambertian Surfaces
Zero-Shot Transfer
Innovation

Methods, ideas, or system contributions that make the work stand out.

surface normals
Syn-to-Real generalization
gated fusion
sparse attention
non-Lambertian regions
🔎 Similar Papers
No similar papers found.
J
Jiahao Li
Department of Computer Science, City University of Hong Kong, Hong Kong, China
Xinhong Chen
Xinhong Chen
City University of Hong Kong
Natural Language ProcessingCausality MiningAutonomous DrivingMachine Learning
Z
Zhengmin Jiang
Department of Computer Science, City University of Hong Kong, Hong Kong, China
C
Cheng Huang
Department of Computer Science, Southern Methodist University, Dallas, TX, 75205, USA
Y
Yung-Hui Li
Hon Hai Research Institute
Jianping Wang
Jianping Wang
Fellow of IEEE, Fellow of AAIA, Chair Professor, City University of Hong Kong
Autonomous DrivingEdge ComputingCloud ComputingNetworking