🤖 AI Summary
To address low matching accuracy in weak-texture and specular regions, limited representational capacity of Transformers due to attention matrix low-rankness and quadratic complexity, insufficient focus on salient keypoints, and slow inference, this paper proposes an efficient and robust stereo matching method. We introduce a novel Hadamard-product attention mechanism that reduces computational complexity from quadratic to linear; design a Dense Attention Kernel (DAK) to enhance discriminability and mitigate low-rank degradation; propose a Multi-scale Kernel-Oriented Interaction (MKOI) module to jointly model spatial-channel dependencies via multi-scale convolutions; and adopt a recurrent Transformer architecture to improve feature reuse. Evaluated on the KITTI 2012 specular region benchmark, our method achieves state-of-the-art performance, significantly improving matching accuracy in weak-texture and highly specular scenes while maintaining real-time efficiency and strong modeling capability.
📝 Abstract
In light of the advancements in transformer technology, extant research posits the construction of stereo transformers as a potential solution to the binocular stereo matching challenge. However, constrained by the low-rank bottleneck and quadratic complexity of attention mechanisms, stereo transformers still fail to demonstrate sufficient nonlinear expressiveness within a reasonable inference time. The lack of focus on key homonymous points renders the representations of such methods vulnerable to challenging conditions, including reflections and weak textures. Furthermore, a slow computing speed is not conducive to the application. To overcome these difficulties, we present the extbf{H}adamard extbf{A}ttention extbf{R}ecurrent Stereo extbf{T}ransformer (HART) that incorporates the following components: 1) For faster inference, we present a Hadamard product paradigm for the attention mechanism, achieving linear computational complexity. 2) We designed a Dense Attention Kernel (DAK) to amplify the differences between relevant and irrelevant feature responses. This allows HART to focus on important details. DAK also converts zero elements to non-zero elements to mitigate the reduced expressiveness caused by the low-rank bottleneck. 3) To compensate for the spatial and channel interaction missing in the Hadamard product, we propose MKOI to capture both global and local information through the interleaving of large and small kernel convolutions. Experimental results demonstrate the effectiveness of our HART. In reflective area, HART ranked extbf{1st} on the KITTI 2012 benchmark among all published methods at the time of submission. Code is available at url{https://github.com/ZYangChen/HART}.