🤖 AI Summary
To address the challenges of scarce annotations and difficult cross-modal alignment in underwater scene semantic understanding, this paper introduces the first vision-language foundation model tailored for marine domains. Methodologically, we propose an unsupervised image–text alignment framework; design a learnable prompt-guided progressive visual encoder and a vision-enhanced language encoder; and integrate contrastive pretraining, prompt-driven ViT feature aggregation, and vision-guided language modeling. We further construct a large-scale heterogeneous underwater image–text dataset comprising 2 million pairs. Evaluated on zero-shot segmentation, classification, detection, and marine organism counting, our model significantly outperforms state-of-the-art methods, enhancing both robustness and interpretability. Additionally, we establish the first comprehensive underwater multimodal benchmark for joint evaluation across diverse tasks.
📝 Abstract
The preservation of aquatic biodiversity is critical in mitigating the effects of climate change. Aquatic scene understanding plays a pivotal role in aiding marine scientists in their decision-making processes. In this paper, we introduce AquaticCLIP, a novel contrastive language-image pre-training model tailored for aquatic scene understanding. AquaticCLIP presents a new unsupervised learning framework that aligns images and texts in aquatic environments, enabling tasks such as segmentation, classification, detection, and object counting. By leveraging our large-scale underwater image-text paired dataset without the need for ground-truth annotations, our model enriches existing vision-language models in the aquatic domain. For this purpose, we construct a 2 million underwater image-text paired dataset using heterogeneous resources, including YouTube, Netflix, NatGeo, etc. To fine-tune AquaticCLIP, we propose a prompt-guided vision encoder that progressively aggregates patch features via learnable prompts, while a vision-guided mechanism enhances the language encoder by incorporating visual context. The model is optimized through a contrastive pretraining loss to align visual and textual modalities. AquaticCLIP achieves notable performance improvements in zero-shot settings across multiple underwater computer vision tasks, outperforming existing methods in both robustness and interpretability. Our model sets a new benchmark for vision-language applications in underwater environments. The code and dataset for AquaticCLIP are publicly available on GitHub at xxx.