🤖 AI Summary
Visual SLAM suffers from degraded accuracy, scale ambiguity, and global inconsistency in challenging environments such as low-texture and low-illumination scenes. To address these issues, this paper proposes a tightly coupled multi-sensor framework that jointly fuses feedforward neural-network-driven point-cloud geometric regression, IMU measurements, and GNSS observations. Crucially, Sim(3) visual alignment constraints are integrated into an SE(3) factor graph, enabling hierarchical optimization over both sliding windows and global loop closures. This work is the first to co-optimize deep geometric priors with heterogeneous sensor measurements under a unified Hessian-form constraint formulation, significantly improving scale consistency and global mapping accuracy. Evaluated on public and custom-collected datasets, the system demonstrates superior accuracy, robustness, and consistency compared to state-of-the-art vision-centric multi-sensor SLAM approaches. The source code will be made publicly available.
📝 Abstract
Visual SLAM is a cornerstone technique in robotics, autonomous driving and extended reality (XR), yet classical systems often struggle with low-texture environments, scale ambiguity, and degraded performance under challenging visual conditions. Recent advancements in feed-forward neural network-based pointmap regression have demonstrated the potential to recover high-fidelity 3D scene geometry directly from images, leveraging learned spatial priors to overcome limitations of traditional multi-view geometry methods. However, the widely validated advantages of probabilistic multi-sensor information fusion are often discarded in these pipelines. In this work, we propose MASt3R-Fusion,a multi-sensor-assisted visual SLAM framework that tightly integrates feed-forward pointmap regression with complementary sensor information, including inertial measurements and GNSS data. The system introduces Sim(3)-based visualalignment constraints (in the Hessian form) into a universal metric-scale SE(3) factor graph for effective information fusion. A hierarchical factor graph design is developed, which allows both real-time sliding-window optimization and global optimization with aggressive loop closures, enabling real-time pose tracking, metric-scale structure perception and globally consistent mapping. We evaluate our approach on both public benchmarks and self-collected datasets, demonstrating substantial improvements in accuracy and robustness over existing visual-centered multi-sensor SLAM systems. The code will be released open-source to support reproducibility and further research (https://github.com/GREAT-WHU/MASt3R-Fusion).