🤖 AI Summary
Existing learning-based stereo matching models exhibit poor zero-shot generalization under adverse weather conditions, primarily due to the scarcity of real-world degraded stereo data and insufficient feature discriminability. To address this, we propose a robust stereo matching framework for zero-shot domain adaptation. First, we introduce a diffusion-model-driven stereo-consistent weather simulation framework that synthesizes physically plausible, structurally aligned stereo image pairs under rain, fog, and snow. Second, we design a hybrid ConvNet-Transformer robust encoder that jointly leverages local detail modeling and global denoising capabilities, enhancing invariance to noise, low contrast, and blur. Experiments demonstrate substantial improvements in disparity estimation accuracy across unseen adverse weather conditions. Our method achieves state-of-the-art robustness in depth estimation, outperforming existing approaches on multiple benchmarks. This work provides a reliable zero-shot stereo perception solution for safety-critical applications such as autonomous driving.
📝 Abstract
Learning-based stereo matching models struggle in adverse weather conditions due to the scarcity of corresponding training data and the challenges in extracting discriminative features from degraded images. These limitations significantly hinder zero-shot generalization to out-of-distribution weather conditions. In this paper, we propose extbf{RobuSTereo}, a novel framework that enhances the zero-shot generalization of stereo matching models under adverse weather by addressing both data scarcity and feature extraction challenges. First, we introduce a diffusion-based simulation pipeline with a stereo consistency module, which generates high-quality stereo data tailored for adverse conditions. By training stereo matching models on our synthetic datasets, we reduce the domain gap between clean and degraded images, significantly improving the models' robustness to unseen weather conditions. The stereo consistency module ensures structural alignment across synthesized image pairs, preserving geometric integrity and enhancing depth estimation accuracy. Second, we design a robust feature encoder that combines a specialized ConvNet with a denoising transformer to extract stable and reliable features from degraded images. The ConvNet captures fine-grained local structures, while the denoising transformer refines global representations, effectively mitigating the impact of noise, low visibility, and weather-induced distortions. This enables more accurate disparity estimation even under challenging visual conditions. Extensive experiments demonstrate that extbf{RobuSTereo} significantly improves the robustness and generalization of stereo matching models across diverse adverse weather scenarios.