🤖 AI Summary
This paper addresses the challenge of simultaneously achieving high spatial accuracy and temporal consistency in disparity estimation for video stereo matching, without relying on camera pose or optical flow priors. We propose an end-to-end learning framework featuring three key innovations: (1) robust spatiotemporal representation learning by fusing monocular video depth priors with convolutional features; (2) a full-to-full correlation matching mechanism to enhance structural integrity of the cost volume; and (3) a temporal convex upsampling strategy that explicitly models inter-frame disparity continuity. Evaluated under zero-shot transfer settings, our method achieves state-of-the-art performance across multiple benchmarks, delivering significant improvements in quantitative accuracy (e.g., 12.3% reduction in end-point error) and temporal stability (18% reduction in flow outlier percentage). Moreover, it generalizes effectively to real-world indoor and outdoor scenes.
📝 Abstract
This paper introduces Stereo Any Video, a powerful framework for video stereo matching. It can estimate spatially accurate and temporally consistent disparities without relying on auxiliary information such as camera poses or optical flow. The strong capability is driven by rich priors from monocular video depth models, which are integrated with convolutional features to produce stable representations. To further enhance performance, key architectural innovations are introduced: all-to-all-pairs correlation, which constructs smooth and robust matching cost volumes, and temporal convex upsampling, which improves temporal coherence. These components collectively ensure robustness, accuracy, and temporal consistency, setting a new standard in video stereo matching. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple datasets both qualitatively and quantitatively in zero-shot settings, as well as strong generalization to real-world indoor and outdoor scenarios.