Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching

📅 2025-12-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing stereo vision foundation models exhibit strong zero-shot generalization but suffer from high computational cost, hindering real-time deployment; lightweight models, while efficient, lack generalization and require domain-specific fine-tuning. This paper proposes the first stereo matching method that simultaneously achieves zero-shot robustness and real-time performance. Our approach introduces a divide-and-conquer acceleration framework comprising: (1) knowledge distillation to compress a hybrid backbone network; (2) block-level neural architecture search to automatically optimize the cost aggregation module; and (3) structured pruning to streamline the iterative optimization process. We further construct a large-scale dataset of 1.4 million pseudo-labeled野外 scenes, generated via automated pseudo-labeling and synthetic augmentation. Experiments show that our method retains near-identical zero-shot accuracy compared to FoundationStereo while achieving a 10× speedup—reaching real-time frame rates—and establishing new state-of-the-art performance among real-time methods.

Technology Category

Application Category

📝 Abstract
Stereo foundation models achieve strong zero-shot generalization but remain computationally prohibitive for real-time applications. Efficient stereo architectures, on the other hand, sacrifice robustness for speed and require costly per-domain fine-tuning. To bridge this gap, we present Fast-FoundationStereo, a family of architectures that achieve, for the first time, strong zero-shot generalization at real-time frame rate. We employ a divide-and-conquer acceleration strategy with three components: (1) knowledge distillation to compress the hybrid backbone into a single efficient student; (2) blockwise neural architecture search for automatically discovering optimal cost filtering designs under latency budgets, reducing search complexity exponentially; and (3) structured pruning for eliminating redundancy in the iterative refinement module. Furthermore, we introduce an automatic pseudo-labeling pipeline used to curate 1.4M in-the-wild stereo pairs to supplement synthetic training data and facilitate knowledge distillation. The resulting model can run over 10x faster than FoundationStereo while closely matching its zero-shot accuracy, thus establishing a new state-of-the-art among real-time methods. Project page: https://nvlabs.github.io/Fast-FoundationStereo/
Problem

Research questions and friction points this paper is trying to address.

Achieving real-time zero-shot stereo matching
Bridging efficiency and robustness in stereo models
Reducing computational cost without sacrificing generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge distillation compresses hybrid backbone into efficient student
Blockwise neural architecture search optimizes cost filtering designs
Structured pruning eliminates redundancy in iterative refinement module
🔎 Similar Papers
No similar papers found.