🤖 AI Summary
This work addresses the challenge of unifying full SLAM functionality—front-end tracking, incremental mapping, and back-end global optimization—within a single end-to-end learnable architecture. We propose the first neural SLAM framework that integrates all these components into a unified Transformer model. By serializing monocular video streams as spatiotemporal sequences, the model jointly and iteratively optimizes camera poses and dense depth maps, enabling geometrically consistent, tightly coupled reconstruction. Key contributions include: (i) the first holistic integration of complete SLAM pipeline within a single Transformer, eliminating conventional modular design; and (ii) novel mechanisms for incremental feature updating and cross-frame joint pose-depth optimization. Evaluated on multiple standard benchmarks, our method matches or surpasses state-of-the-art dense SLAM approaches in accuracy and robustness, with particularly notable improvements in dynamic scenes and long-duration sequences.
📝 Abstract
We present SLAM-Former, a novel neural approach that integrates full SLAM capabilities into a single transformer. Similar to traditional SLAM systems, SLAM-Former comprises both a frontend and a backend that operate in tandem. The frontend processes sequential monocular images in real-time for incremental mapping and tracking, while the backend performs global refinement to ensure a geometrically consistent result. This alternating execution allows the frontend and backend to mutually promote one another, enhancing overall system performance. Comprehensive experimental results demonstrate that SLAM-Former achieves superior or highly competitive performance compared to state-of-the-art dense SLAM methods.