🤖 AI Summary
Current waterway perception models are limited to instance-level detection and segmentation, lacking global semantic understanding—hindering large-scale monitoring and structured log generation. To address this, we introduce WaterCaption, the first image captioning dataset specifically designed for waterway scenes, and propose Da Yu, a novel multimodal large language model tailored for waterway surveillance. Da Yu is the first framework to support fine-grained, multi-region, and long-form textual descriptions of waterway imagery. We further design a lightweight Nano Transformer Adaptor (NTA) to enable efficient vision–language alignment while jointly modeling global and local features, facilitating edge deployment. Experiments demonstrate that Da Yu surpasses state-of-the-art methods on WaterCaption and multiple general-purpose image captioning benchmarks, achieving an optimal trade-off between performance and computational efficiency. This significantly enhances the situational awareness capability of unmanned surface vehicles in complex waterway environments.
📝 Abstract
Automated waterway environment perception is crucial for enabling unmanned surface vessels (USVs) to understand their surroundings and make informed decisions. Most existing waterway perception models primarily focus on instance-level object perception paradigms (e.g., detection, segmentation). However, due to the complexity of waterway environments, current perception datasets and models fail to achieve global semantic understanding of waterways, limiting large-scale monitoring and structured log generation. With the advancement of vision-language models (VLMs), we leverage image captioning to introduce WaterCaption, the first captioning dataset specifically designed for waterway environments. WaterCaption focuses on fine-grained, multi-region long-text descriptions, providing a new research direction for visual geo-understanding and spatial scene cognition. Exactly, it includes 20.2k image-text pair data with 1.8 million vocabulary size. Additionally, we propose Da Yu, an edge-deployable multi-modal large language model for USVs, where we propose a novel vision-to-language projector called Nano Transformer Adaptor (NTA). NTA effectively balances computational efficiency with the capacity for both global and fine-grained local modeling of visual features, thereby significantly enhancing the model's ability to generate long-form textual outputs. Da Yu achieves an optimal balance between performance and efficiency, surpassing state-of-the-art models on WaterCaption and several other captioning benchmarks.