🤖 AI Summary
Existing visual SLAM systems fail in turbid underwater environments due to severe light attenuation, backscatter, and low contrast, and further lack support for multi-camera configurations. To address these challenges, this paper proposes a multimodal tightly coupled SLAM framework tailored for work-class ROVs, integrating multi-view cameras, IMU, and forward-looking sonar. Methodologically, it introduces: (i) geometric visual-inertial odometry (VIO) tightly coupled with sonar registration via joint optimization; (ii) cross-modal calibration unifying optical, inertial, and sonar coordinate frames; (iii) deep learning–driven robust feature extraction resilient to underwater degradation; and (iv) real-time semantic segmentation–guided 3D reconstruction. Evaluated in the Trondheim Fjord, the system achieves >15 Hz real-time pose estimation and centimeter-level reconstruction accuracy—substantially outperforming monocular and stereo baselines. It is the first underwater SLAM framework to support arbitrary multi-camera topologies and semantic-enhanced mapping.
📝 Abstract
Autonomous Underwater Vehicles (AUVs) and Remotely Operated Vehicles (ROVs) demand robust spatial perception capabilities, including Simultaneous Localization and Mapping (SLAM), to support both remote and autonomous tasks. Vision-based systems have been integral to these advancements, capturing rich color and texture at low cost while enabling semantic scene understanding. However, underwater conditions -- such as light attenuation, backscatter, and low contrast -- often degrade image quality to the point where traditional vision-based SLAM pipelines fail. Moreover, these pipelines typically rely on monocular or stereo inputs, limiting their scalability to the multi-camera configurations common on many vehicles. To address these issues, we propose to leverage multi-modal sensing that fuses data from multiple sensors-including cameras, inertial measurement units (IMUs), and acoustic devices-to enhance situational awareness and enable robust, real-time SLAM. We explore both geometric and learning-based techniques along with semantic analysis, and conduct experiments on the data collected from a work-class ROV during several field deployments in the Trondheim Fjord. Through our experimental results, we demonstrate the feasibility of real-time reliable state estimation and high-quality 3D reconstructions in visually challenging underwater conditions. We also discuss system constraints and identify open research questions, such as sensor calibration, limitations with learning-based methods, that merit further exploration to advance large-scale underwater operations.