SCE-SLAM: Scale-Consistent Monocular SLAM via Scene Coordinate Embeddings

📅 2026-01-14

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the scale drift issue in monocular SLAM during long-term operation, which arises from the lack of global constraints. To mitigate this, the authors propose an end-to-end SLAM system that embeds 3D geometric relationships using scene coordinates normalized to a canonical scale. A geometrically modulated attention mechanism is introduced to aggregate historical information across temporal windows, enforcing global scale consistency. Furthermore, the system integrates geometry-guided feature aggregation with scene coordinate-based bundle adjustment to jointly refine camera poses. Evaluated on KITTI, Waymo, and vKITTI datasets, the method significantly reduces absolute trajectory error—by 8.36 meters on KITTI—while maintaining real-time performance at 36 FPS, effectively resolving scale drift in large-scale environments.

Technology Category

Application Category

📝 Abstract

Monocular visual SLAM enables 3D reconstruction from internet video and autonomous navigation on resource-constrained platforms, yet suffers from scale drift, i.e., the gradual divergence of estimated scale over long sequences. Existing frame-to-frame methods achieve real-time performance through local optimization but accumulate scale drift due to the lack of global constraints among independent windows. To address this, we propose SCE-SLAM, an end-to-end SLAM system that maintains scale consistency through scene coordinate embeddings, which are learned patch-level representations encoding 3D geometric relationships under a canonical scale reference. The framework consists of two key modules: geometry-guided aggregation that leverages 3D spatial proximity to propagate scale information from historical observations through geometry-modulated attention, and scene coordinate bundle adjustment that anchors current estimates to the reference scale through explicit 3D coordinate constraints decoded from the scene coordinate embeddings. Experiments on KITTI, Waymo, and vKITTI demonstrate substantial improvements: our method reduces absolute trajectory error by 8.36m on KITTI compared to the best prior approach, while maintaining 36 FPS and achieving scale consistency across large-scale scenes.

Problem

Research questions and friction points this paper is trying to address.

scale drift

monocular SLAM

scale consistency

3D reconstruction

visual SLAM

Innovation

Methods, ideas, or system contributions that make the work stand out.

scene coordinate embeddings

scale consistency

monocular SLAM