🤖 AI Summary
Existing stereo vision foundation models exhibit strong zero-shot generalization but suffer from high computational cost, hindering real-time deployment; lightweight models, while efficient, lack generalization and require domain-specific fine-tuning. This paper proposes the first stereo matching method that simultaneously achieves zero-shot robustness and real-time performance. Our approach introduces a divide-and-conquer acceleration framework comprising: (1) knowledge distillation to compress a hybrid backbone network; (2) block-level neural architecture search to automatically optimize the cost aggregation module; and (3) structured pruning to streamline the iterative optimization process. We further construct a large-scale dataset of 1.4 million pseudo-labeled野外 scenes, generated via automated pseudo-labeling and synthetic augmentation. Experiments show that our method retains near-identical zero-shot accuracy compared to FoundationStereo while achieving a 10× speedup—reaching real-time frame rates—and establishing new state-of-the-art performance among real-time methods.
📝 Abstract
Stereo foundation models achieve strong zero-shot generalization but remain computationally prohibitive for real-time applications. Efficient stereo architectures, on the other hand, sacrifice robustness for speed and require costly per-domain fine-tuning. To bridge this gap, we present Fast-FoundationStereo, a family of architectures that achieve, for the first time, strong zero-shot generalization at real-time frame rate. We employ a divide-and-conquer acceleration strategy with three components: (1) knowledge distillation to compress the hybrid backbone into a single efficient student; (2) blockwise neural architecture search for automatically discovering optimal cost filtering designs under latency budgets, reducing search complexity exponentially; and (3) structured pruning for eliminating redundancy in the iterative refinement module. Furthermore, we introduce an automatic pseudo-labeling pipeline used to curate 1.4M in-the-wild stereo pairs to supplement synthetic training data and facilitate knowledge distillation. The resulting model can run over 10x faster than FoundationStereo while closely matching its zero-shot accuracy, thus establishing a new state-of-the-art among real-time methods. Project page: https://nvlabs.github.io/Fast-FoundationStereo/