🤖 AI Summary
To address the limited robustness of neural implicit (NeRF) and explicit (3D Gaussian Splatting, 3DGS) representations in RGB-D SLAM—stemming from insufficient scene understanding—this paper proposes DINO-SLAM. Our method pioneers the integration of DINO visual features with a hierarchical Scene Structure Encoder (SSE) to construct a semantics-geometry co-enhanced representation. We design a unified framework enabling both NeRF- and 3DGS-based SLAM paradigms to share DINO feature inputs and undergo joint optimization. Furthermore, we introduce an Enhanced DINO (EDINO) feature integration mechanism to achieve cross-representation feature alignment and gradient cooperation. Evaluated on Replica, ScanNet, and TUM datasets, DINO-SLAM significantly improves pose estimation accuracy and map completeness, consistently outperforming state-of-the-art SLAM methods across all benchmarks.
📝 Abstract
This paper presents DINO-SLAM, a DINO-informed design strategy to enhance neural implicit (Neural Radiance Field -- NeRF) and explicit representations (3D Gaussian Splatting -- 3DGS) in SLAM systems through more comprehensive scene representations. Purposely, we rely on a Scene Structure Encoder (SSE) that enriches DINO features into Enhanced DINO ones (EDINO) to capture hierarchical scene elements and their structural relationships. Building upon it, we propose two foundational paradigms for NeRF and 3DGS SLAM systems integrating EDINO features. Our DINO-informed pipelines achieve superior performance on the Replica, ScanNet, and TUM compared to state-of-the-art methods.