VIMD: Monocular Visual-Inertial Motion and Depth Estimation

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low accuracy of dense metric depth estimation and severe scale drift in monocular visual-inertial SLAM, this paper proposes VIMD: a novel framework that leverages an MSCKF-based visual-inertial odometry to provide sparse (10–20 per frame), metric-scale 3D points, which guide a deep learning–based, pixel-wise iterative scale refinement mechanism—eliminating reliance on global affine scale fitting. Its modular architecture supports integration with diverse depth estimation backbones. Evaluated on TartanAir, VOID, and AR Table benchmarks, VIMD achieves state-of-the-art accuracy and strong robustness, exhibits zero-shot generalization to unseen environments, and maintains high computational efficiency—making it suitable for real-time 3D perception in resource-constrained robotic and extended reality (XR) systems.

Technology Category

Application Category

📝 Abstract
Accurate and efficient dense metric depth estimation is crucial for 3D visual perception in robotics and XR. In this paper, we develop a monocular visual-inertial motion and depth (VIMD) learning framework to estimate dense metric depth by leveraging accurate and efficient MSCKF-based monocular visual-inertial motion tracking. At the core the proposed VIMD is to exploit multi-view information to iteratively refine per-pixel scale, instead of globally fitting an invariant affine model as in the prior work. The VIMD framework is highly modular, making it compatible with a variety of existing depth estimation backbones. We conduct extensive evaluations on the TartanAir and VOID datasets and demonstrate its zero-shot generalization capabilities on the AR Table dataset. Our results show that VIMD achieves exceptional accuracy and robustness, even with extremely sparse points as few as 10-20 metric depth points per image. This makes the proposed VIMD a practical solution for deployment in resource constrained settings, while its robust performance and strong generalization capabilities offer significant potential across a wide range of scenarios.
Problem

Research questions and friction points this paper is trying to address.

Estimating dense metric depth from monocular visual-inertial data
Refining per-pixel scale iteratively using multi-view information
Achieving accurate depth estimation with sparse metric depth points
Innovation

Methods, ideas, or system contributions that make the work stand out.

Monocular visual-inertial framework for depth estimation
Iteratively refines per-pixel scale using multi-view information
Highly modular and compatible with various backbones
🔎 Similar Papers
No similar papers found.
Saimouli Katragadda
Saimouli Katragadda
University of Delaware
RoboticsComputer VisionVINS
G
Guoquan Huang
Robot Perception and Navigation Group (RPNG), University of Delaware, Newark, DE 19716