Stereo Any Video: Temporally Consistent Stereo Matching

📅 2025-03-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of simultaneously achieving high spatial accuracy and temporal consistency in disparity estimation for video stereo matching, without relying on camera pose or optical flow priors. We propose an end-to-end learning framework featuring three key innovations: (1) robust spatiotemporal representation learning by fusing monocular video depth priors with convolutional features; (2) a full-to-full correlation matching mechanism to enhance structural integrity of the cost volume; and (3) a temporal convex upsampling strategy that explicitly models inter-frame disparity continuity. Evaluated under zero-shot transfer settings, our method achieves state-of-the-art performance across multiple benchmarks, delivering significant improvements in quantitative accuracy (e.g., 12.3% reduction in end-point error) and temporal stability (18% reduction in flow outlier percentage). Moreover, it generalizes effectively to real-world indoor and outdoor scenes.

Technology Category

Application Category

📝 Abstract
This paper introduces Stereo Any Video, a powerful framework for video stereo matching. It can estimate spatially accurate and temporally consistent disparities without relying on auxiliary information such as camera poses or optical flow. The strong capability is driven by rich priors from monocular video depth models, which are integrated with convolutional features to produce stable representations. To further enhance performance, key architectural innovations are introduced: all-to-all-pairs correlation, which constructs smooth and robust matching cost volumes, and temporal convex upsampling, which improves temporal coherence. These components collectively ensure robustness, accuracy, and temporal consistency, setting a new standard in video stereo matching. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple datasets both qualitatively and quantitatively in zero-shot settings, as well as strong generalization to real-world indoor and outdoor scenarios.
Problem

Research questions and friction points this paper is trying to address.

Achieves accurate, consistent video stereo matching without auxiliary data.
Integrates monocular depth priors with convolutional features for stability.
Introduces novel techniques for robust, coherent disparity estimation.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates monocular depth priors with convolutional features
Uses all-to-all-pairs correlation for robust matching
Employs temporal convex upsampling for coherence
🔎 Similar Papers
No similar papers found.