Da Yu: Towards USV-Based Image Captioning for Waterway Surveillance and Scene Understanding

📅 2025-06-23

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Current waterway perception models are limited to instance-level detection and segmentation, lacking global semantic understanding—hindering large-scale monitoring and structured log generation. To address this, we introduce WaterCaption, the first image captioning dataset specifically designed for waterway scenes, and propose Da Yu, a novel multimodal large language model tailored for waterway surveillance. Da Yu is the first framework to support fine-grained, multi-region, and long-form textual descriptions of waterway imagery. We further design a lightweight Nano Transformer Adaptor (NTA) to enable efficient vision–language alignment while jointly modeling global and local features, facilitating edge deployment. Experiments demonstrate that Da Yu surpasses state-of-the-art methods on WaterCaption and multiple general-purpose image captioning benchmarks, achieving an optimal trade-off between performance and computational efficiency. This significantly enhances the situational awareness capability of unmanned surface vehicles in complex waterway environments.

Technology Category

Application Category

📝 Abstract

Automated waterway environment perception is crucial for enabling unmanned surface vessels (USVs) to understand their surroundings and make informed decisions. Most existing waterway perception models primarily focus on instance-level object perception paradigms (e.g., detection, segmentation). However, due to the complexity of waterway environments, current perception datasets and models fail to achieve global semantic understanding of waterways, limiting large-scale monitoring and structured log generation. With the advancement of vision-language models (VLMs), we leverage image captioning to introduce WaterCaption, the first captioning dataset specifically designed for waterway environments. WaterCaption focuses on fine-grained, multi-region long-text descriptions, providing a new research direction for visual geo-understanding and spatial scene cognition. Exactly, it includes 20.2k image-text pair data with 1.8 million vocabulary size. Additionally, we propose Da Yu, an edge-deployable multi-modal large language model for USVs, where we propose a novel vision-to-language projector called Nano Transformer Adaptor (NTA). NTA effectively balances computational efficiency with the capacity for both global and fine-grained local modeling of visual features, thereby significantly enhancing the model's ability to generate long-form textual outputs. Da Yu achieves an optimal balance between performance and efficiency, surpassing state-of-the-art models on WaterCaption and several other captioning benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Enabling USVs to achieve global semantic understanding of waterways

Addressing lack of captioning datasets for waterway environments

Balancing computational efficiency with visual feature modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces WaterCaption dataset for waterway captioning

Proposes Nano Transformer Adaptor for vision-language projection

Develops edge-deployable multi-modal model Da Yu

🔎 Similar Papers

NanoMVG: USV-Centric Low-Power Multi-Task Visual Grounding based on Prompt-Guided Camera and 4D mmWave Radar