StereoVGGT: A Training-Free Visual Geometry Transformer for Stereo Vision

πŸ“… 2026-03-31
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing stereo vision models suffer from severe degradation of geometric information during feature extraction due to the absence of explicit camera pose supervision. This work proposes StereoVGGT, which, for the first time, adapts a frozen pre-trained Visual Geometry Group Transformer (VGGT) to stereo matching without any additional training. By incorporating binocular geometric constraints, StereoVGGT effectively reconstructs feature representations to preserve and leverage the model’s intrinsic camera calibration and 3D geometric priors. The method achieves state-of-the-art performance on the KITTI benchmark, surpassing all previously published approaches and securing the top rank.
πŸ“ Abstract
Driven by the advancement of 3D devices, stereo vision tasks including stereo matching and stereo conversion have emerged as a critical research frontier. Contemporary stereo vision backbones typically rely on either monocular depth estimation (MDE) models or visual foundation models (VFMs). Crucially, these models are predominantly pretrained without explicit supervision of camera poses. Given that such geometric knowledge is indispensable for stereo vision, the absence of explicit spatial constraints constitutes a significant performance bottleneck for existing architectures. Recognizing that the Visual Geometry Grounded Transformer (VGGT) operates as a foundation model pretrained on extensive 3D priors, including camera poses, we investigate its potential as a robust backbone for stereo vision tasks. Nevertheless, empirical results indicate that its direct application to stereo vision yields suboptimal performance. We observe that VGGT suffers from a more significant degradation of geometric details during feature extraction. Such characteristics conflict with the requirements of binocular stereo vision, thereby constraining its efficacy for relative tasks. To bridge this gap, we propose StereoVGGT, a feature backbone specifically tailored for stereo vision. By leveraging the frozen VGGT and introducing a training-free feature adjustment pipeline, we mitigate geometric degradation and harness the latent camera calibration knowledge embedded within the model. StereoVGGT-based stereo matching network achieved the $1^{st}$ rank among all published methods on the KITTI benchmark, validating that StereoVGGT serves as a highly effective backbone for stereo vision.
Problem

Research questions and friction points this paper is trying to address.

stereo vision
camera pose
geometric degradation
feature extraction
visual foundation models
Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free
stereo vision
Visual Geometry Transformer
camera pose priors
feature adjustment
πŸ”Ž Similar Papers
No similar papers found.