🤖 AI Summary
To address the challenges of unobservable scale, unstable initialization, and unreliable loop closure in multi-camera SLAM—stemming from arbitrary camera configurations—this paper proposes the first end-to-end visual odometry framework designed for generic multi-camera setups. Methodologically, it introduces a novel learning-driven framework that jointly models multi-stream feature extraction and inter-camera rigid-motion constraints, enabling online scale initialization and refinement. The approach integrates learned feature tracking, multi-camera rigid-body motion priors, multi-source feature map optimization, and multi-view loop closure detection. Evaluated on KITTI-360 and our newly introduced MultiCamData benchmark, the method significantly outperforms existing stereo and multi-camera SLAM systems in pose accuracy, robustness to wide-field-of-view and texture-deprived scenes, and configurational flexibility—requiring no predefined camera geometry. Code and an interactive online demo are publicly available.
📝 Abstract
Making multi-camera visual SLAM systems easier to set up and more robust to the environment is always one of the focuses of vision robots. Existing monocular and binocular vision SLAM systems have narrow FoV and are fragile in textureless environments with degenerated accuracy and limited robustness. Thus multi-camera SLAM systems are gaining attention because they can provide redundancy for texture degeneration with wide FoV. However, current multi-camera SLAM systems face massive data processing pressure and elaborately designed camera configurations, leading to estimation failures for arbitrarily arranged multi-camera systems. To address these problems, we propose a generic visual odometry for arbitrarily arranged multi-cameras, which can achieve metric-scale state estimation with high flexibility in the cameras' arrangement. Specifically, we first design a learning-based feature extraction and tracking framework to shift the pressure of CPU processing of multiple video streams. Then we use the rigid constraints between cameras to estimate the metric scale poses for robust SLAM system initialization. Finally, we fuse the features of the multi-cameras in the SLAM back-end to achieve robust pose estimation and online scale optimization. Additionally, multi-camera features help improve the loop detection for pose graph optimization. Experiments on KITTI-360 and MultiCamData datasets validate the robustness of our method over arbitrarily placed cameras. Compared with other stereo and multi-camera visual SLAM systems, our method obtains higher pose estimation accuracy with better generalization ability. Our codes and online demos are available at url{https://github.com/JunhaoWang615/MCVO}