🤖 AI Summary
This work addresses the performance bottleneck of monocular depth estimation (MDE) in relative pose estimation, stemming from scale and offset ambiguities inherent in MDE outputs. We propose three generic solvers that explicitly model and correct independent affine ambiguities. Our method jointly optimizes point correspondences, epipolar constraints, PnP, and the essential matrix, while incorporating depth-prior-guided feature matching. To our knowledge, this is the first systematic analysis and correction of affine ambiguity—present even in so-called “metric” depth outputs—enabling robust pose estimation in both calibrated and uncalibrated scenarios. Experiments demonstrate that our framework significantly outperforms conventional keypoint-based and PnP baselines across multiple benchmarks. It is agnostic to both feature matchers and MDE models, and its performance consistently improves with advancements in matching accuracy and depth prediction quality.
📝 Abstract
Monocular depth estimation (MDE) models have undergone significant advancements over recent years. Many MDE models aim to predict affine-invariant relative depth from monocular images, while recent developments in large-scale training and vision foundation models enable reasonable estimation of metric (absolute) depth. However, effectively leveraging these predictions for geometric vision tasks, in particular relative pose estimation, remains relatively under explored. While depths provide rich constraints for cross-view image alignment, the intrinsic noise and ambiguity from the monocular depth priors present practical challenges to improving upon classic keypoint-based solutions. In this paper, we develop three solvers for relative pose estimation that explicitly account for independent affine (scale and shift) ambiguities, covering both calibrated and uncalibrated conditions. We further propose a hybrid estimation pipeline that combines our proposed solvers with classic point-based solvers and epipolar constraints. We find that the affine correction modeling is beneficial to not only the relative depth priors but also, surprisingly, the ``metric"ones. Results across multiple datasets demonstrate large improvements of our approach over classic keypoint-based baselines and PnP-based solutions, under both calibrated and uncalibrated setups. We also show that our method improves consistently with different feature matchers and MDE models, and can further benefit from very recent advances on both modules. Code is available at https://github.com/MarkYu98/madpose.