Dropping the D: RGB-D SLAM Without the Depth Sensor

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

To address the metric scale ambiguity and inferior accuracy of monocular SLAM compared to RGB-D SLAM, this paper proposes a real-time monocular SLAM system that requires no depth sensor. The method integrates three pretrained deep learning components: monocular metric depth estimation, learned keypoint detection, and instance segmentation. Dynamic motion suppression is achieved via dilated instance masks, enabling robust scene understanding and metric scale recovery. The backend adopts a standard RGB-D SLAM framework to balance accuracy and efficiency. Evaluated on the TUM RGB-D dataset, the system achieves an average absolute trajectory error (ATE) of 7.4 cm on static sequences and only 1.8 cm on dynamic sequences—surpassing or matching state-of-the-art RGB-D SLAM methods. It runs at 22 FPS on a single GPU, demonstrating both superior accuracy and real-time performance.

Technology Category

Application Category

📝 Abstract

We present DropD-SLAM, a real-time monocular SLAM system that achieves RGB-D-level accuracy without relying on depth sensors. The system replaces active depth input with three pretrained vision modules: a monocular metric depth estimator, a learned keypoint detector, and an instance segmentation network. Dynamic objects are suppressed using dilated instance masks, while static keypoints are assigned predicted depth values and backprojected into 3D to form metrically scaled features. These are processed by an unmodified RGB-D SLAM back end for tracking and mapping. On the TUM RGB-D benchmark, DropD-SLAM attains 7.4 cm mean ATE on static sequences and 1.8 cm on dynamic sequences, matching or surpassing state-of-the-art RGB-D methods while operating at 22 FPS on a single GPU. These results suggest that modern pretrained vision models can replace active depth sensors as reliable, real-time sources of metric scale, marking a step toward simpler and more cost-effective SLAM systems.

Problem

Research questions and friction points this paper is trying to address.

Achieving RGB-D SLAM accuracy without depth sensors using monocular vision

Replacing active depth input with pretrained vision modules for metric scale

Suppressing dynamic objects while maintaining real-time performance in SLAM

Innovation

Methods, ideas, or system contributions that make the work stand out.

Replaces depth sensors with pretrained vision modules

Uses dilated instance masks to suppress dynamic objects

Backprojects static keypoints with predicted depth values

🔎 Similar Papers

UDGS-SLAM : UniDepth Assisted Gaussian Splatting for Monocular SLAM