π€ AI Summary
Underwater turbid environments cause severe visual degradation, while sonar suffers from low resolution and inherent blurriness. Existing cross-modal reconstruction methods rely on flawed geometric assumptions, leading to artifacts and poor generalization to complex scenes. To address this, we propose the first end-to-end cross-modal fusion framework that embeds a differentiable plane-sweep algorithm into a deep learning architecture, enabling joint optimization of multi-view images and synchronized sonar data. Our approach eliminates heuristic geometric priors and achieves robust, dense depth estimation via differentiable plane sweeping. Extensive experiments on both synthetic and real-world turbid underwater scenes demonstrate significant improvements in depth accuracy and completeness over state-of-the-art methods. Furthermore, we introduce the first publicly available synchronized binocular visionβsonar dataset, establishing a new benchmark for underwater 3D reconstruction research.
π Abstract
Accurate 3D reconstruction in visually-degraded underwater environments remains a formidable challenge. Single-modality approaches are insufficient: vision-based methods fail due to poor visibility and geometric constraints, while sonar is crippled by inherent elevation ambiguity and low resolution. Consequently, prior fusion technique relies on heuristics and flawed geometric assumptions, leading to significant artifacts and an inability to model complex scenes. In this paper, we introduce SonarSweep, a novel, end-to-end deep learning framework that overcomes these limitations by adapting the principled plane sweep algorithm for cross-modal fusion between sonar and visual data. Extensive experiments in both high-fidelity simulation and real-world environments demonstrate that SonarSweep consistently generates dense and accurate depth maps, significantly outperforming state-of-the-art methods across challenging conditions, particularly in high turbidity. To foster further research, we will publicly release our code and a novel dataset featuring synchronized stereo-camera and sonar data, the first of its kind.