🤖 AI Summary
Addressing the challenge of simultaneously achieving model lightweighting and zero-shot generalization in stereo matching, this paper proposes the first ultra-lightweight stereo depth estimation framework. Methodologically, we design a compact yet expressive backbone network, introduce a hybrid cost aggregation module, and establish a three-stage, million-scale training strategy—simulated → synthetic → real—to enhance domain robustness. We empirically demonstrate for the first time that an ultra-light model with only 0.5M parameters and <1% FLOPs of state-of-the-art (SOTA) methods achieves superior cross-domain generalization. The framework attains SOTA performance on four major real-world benchmarks—SceneFlow, KITTI, ETH3D, and Middlebury—matching or even surpassing prior-free heavy models in accuracy. This breakthrough decisively overcomes the conventional trade-off between model efficiency and generalization capability.
📝 Abstract
Recent advances in stereo matching have focused on accuracy, often at the cost of significantly increased model size. Traditionally, the community has regarded efficient models as incapable of zero-shot ability due to their limited capacity. In this paper, we introduce Lite Any Stereo, a stereo depth estimation framework that achieves strong zero-shot generalization while remaining highly efficient. To this end, we design a compact yet expressive backbone to ensure scalability, along with a carefully crafted hybrid cost aggregation module. We further propose a three-stage training strategy on million-scale data to effectively bridge the sim-to-real gap. Together, these components demonstrate that an ultra-light model can deliver strong generalization, ranking 1st across four widely used real-world benchmarks. Remarkably, our model attains accuracy comparable to or exceeding state-of-the-art non-prior-based accurate methods while requiring less than 1% computational cost, setting a new standard for efficient stereo matching.