Depth2Pose: A Pose-Based Benchmark for Monocular Depth Estimation without Ground-Truth Depth

📅 2026-05-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

200K/year
🤖 AI Summary
This work addresses the limitations of current monocular depth estimation methods, which rely on expensive, scene-constrained dense ground-truth depth that poorly reflects their performance in downstream geometric tasks. To overcome this, we propose Depth2Pose, a novel task-driven evaluation paradigm that uses relative camera pose estimation as a proxy task, requiring only sparse pose annotations to effectively assess depth quality. Our framework integrates monocular depth prediction, feature matching, and a depth-aware geometric solver into an end-to-end pose-guided evaluation pipeline, complemented by a Structure-from-Motion (SfM)-based pose annotation strategy. We further introduce the D2P dataset, comprising out-of-distribution challenging scenes that expose the limited generalization of existing approaches, and release an extensible evaluation framework to facilitate future research.
📝 Abstract
Monocular depth estimation has improved significantly in recent years, driven by increasingly powerful models and large-scale training data. Predicted depth is increasingly used as an input signal for downstream tasks such as Structure-from-Motion (SfM), visual localization, and SLAM. However, monocular depth estimators (MDEs) are still primarily evaluated in terms of depth accuracy. Standard metrics aggregate errors globally and may not reflect the usefulness of depth for downstream geometric tasks. We therefore propose Depth2Pose, a framework for evaluating MDEs in the context of downstream tasks. By combining depth predictions with feature correspondences in depth-aware geometric solvers, we use relative camera pose estimation accuracy as a task-driven proxy for depth quality. Traditional benchmarks require dense ground truth in the form of per-pixel depth, which is expensive to obtain. In contrast, our formulation requires only camera poses, which can be estimated efficiently, e.g., using Structure-from-Motion pipelines. As a result, our framework can be applied to scenes where ground-truth depth is difficult to obtain, for example due to large scene scale or heavy occlusions (e.g., vegetated environments). Leveraging this, we introduce the D2P dataset, which contains challenging scenes outside the distribution of commonly used training data. We show that methods performing well under standard depth error metrics on existing benchmarks also perform well under our pose-based metric when evaluated on the same datasets, but do not necessarily generalize to our more challenging dataset. Finally, we provide a simple and extensible evaluation framework. The dataset and code are available at kocurvik.github.io/depth2pose.
Problem

Research questions and friction points this paper is trying to address.

monocular depth estimation
downstream tasks
evaluation benchmark
ground-truth depth
camera pose
Innovation

Methods, ideas, or system contributions that make the work stand out.

monocular depth estimation
pose-based evaluation
depth-aware geometric solvers
Structure-from-Motion
task-driven benchmark
Viktor Kocur
Viktor Kocur
Assistant Professor, Comenius University
computer vision3D visiondeep learning
S
Sithu Aung
Visual Recognition Group, Faculty of Electrical Engineering, Czech Technical University in Prague
G
Gabrielle Flood
Visual Recognition Group, Faculty of Electrical Engineering, Czech Technical University in Prague
Yaqing Ding
Yaqing Ding
Czech Technical University in Prague
Computer Vision
L
Lukas Bujnak
Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava
Torsten Sattler
Torsten Sattler
Senior Researcher, Czech Technical University in Prague
Computer VisionRoboticsMixed RealityVisual LocalizationApplied Machine Learning
Zuzana Kukelova
Zuzana Kukelova
Assistant Professor, Czech Technical University in Prague
Computer visionMinimal problemsAlgebraic geometry