NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding

📅 2025-10-31

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

Underwater scene understanding suffers from the absence of large-scale, multi-task instruction-tuning datasets and severe image degradation, which critically limits model performance. Method: We propose a physics-guided, plug-and-play Visual Feature Enhancement (VFE) module that incorporates underwater imaging physical models, integrated into LLaVA-1.5 and Qwen2.5-VL architectures; additionally, we introduce NautData—the first large-scale multimodal underwater instruction-tuning dataset supporting eight diverse tasks—and perform instruction tuning using multi-granularity image-text pairs. Contribution/Results: VFE significantly improves baseline models’ robustness on degraded underwater imagery. The NAUTILUS model, trained on NautData, achieves state-of-the-art performance across multiple underwater understanding tasks, demonstrating the effectiveness of synergistically combining physics-informed feature enhancement with multi-task instruction tuning.

Technology Category

Application Category

📝 Abstract

Underwater exploration offers critical insights into our planet and attracts increasing attention for its broader applications in resource exploration, national security, etc. We study the underwater scene understanding methods, which aim to achieve automated underwater exploration. The underwater scene understanding task demands multi-task perceptions from multiple granularities. However, the absence of large-scale underwater multi-task instruction-tuning datasets hinders the progress of this research. To bridge this gap, we construct NautData, a dataset containing 1.45 M image-text pairs supporting eight underwater scene understanding tasks. It enables the development and thorough evaluation of the underwater scene understanding models. Underwater image degradation is a widely recognized challenge that interferes with underwater tasks. To improve the robustness of underwater scene understanding, we introduce physical priors derived from underwater imaging models and propose a plug-and-play vision feature enhancement (VFE) module, which explicitly restores clear underwater information. We integrate this module into renowned baselines LLaVA-1.5 and Qwen2.5-VL and build our underwater LMM, NAUTILUS. Experiments conducted on the NautData and public underwater datasets demonstrate the effectiveness of the VFE module, consistently improving the performance of both baselines on the majority of supported tasks, thus ensuring the superiority of NAUTILUS in the underwater scene understanding area. Data and models are available at https://github.com/H-EmbodVis/NAUTILUS.

Problem

Research questions and friction points this paper is trying to address.

Addressing underwater image degradation for scene understanding

Developing large-scale multimodal dataset for underwater tasks

Enhancing model robustness with physical priors in imaging

Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructed NautData dataset with 1.45M image-text pairs

Proposed plug-and-play vision feature enhancement module

Integrated enhancement module into LLaVA and Qwen2.5-VL baselines

🔎 Similar Papers

SeePerSea: Multi-modal Perception Dataset of In-water Objects for Autonomous Surface Vehicles