Da Yu: Towards USV-Based Image Captioning for Waterway Surveillance and Scene Understanding

📅 2025-06-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current waterway perception models are limited to instance-level detection and segmentation, lacking global semantic understanding—hindering large-scale monitoring and structured log generation. To address this, we introduce WaterCaption, the first image captioning dataset specifically designed for waterway scenes, and propose Da Yu, a novel multimodal large language model tailored for waterway surveillance. Da Yu is the first framework to support fine-grained, multi-region, and long-form textual descriptions of waterway imagery. We further design a lightweight Nano Transformer Adaptor (NTA) to enable efficient vision–language alignment while jointly modeling global and local features, facilitating edge deployment. Experiments demonstrate that Da Yu surpasses state-of-the-art methods on WaterCaption and multiple general-purpose image captioning benchmarks, achieving an optimal trade-off between performance and computational efficiency. This significantly enhances the situational awareness capability of unmanned surface vehicles in complex waterway environments.

Technology Category

Application Category

📝 Abstract
Automated waterway environment perception is crucial for enabling unmanned surface vessels (USVs) to understand their surroundings and make informed decisions. Most existing waterway perception models primarily focus on instance-level object perception paradigms (e.g., detection, segmentation). However, due to the complexity of waterway environments, current perception datasets and models fail to achieve global semantic understanding of waterways, limiting large-scale monitoring and structured log generation. With the advancement of vision-language models (VLMs), we leverage image captioning to introduce WaterCaption, the first captioning dataset specifically designed for waterway environments. WaterCaption focuses on fine-grained, multi-region long-text descriptions, providing a new research direction for visual geo-understanding and spatial scene cognition. Exactly, it includes 20.2k image-text pair data with 1.8 million vocabulary size. Additionally, we propose Da Yu, an edge-deployable multi-modal large language model for USVs, where we propose a novel vision-to-language projector called Nano Transformer Adaptor (NTA). NTA effectively balances computational efficiency with the capacity for both global and fine-grained local modeling of visual features, thereby significantly enhancing the model's ability to generate long-form textual outputs. Da Yu achieves an optimal balance between performance and efficiency, surpassing state-of-the-art models on WaterCaption and several other captioning benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Enabling USVs to achieve global semantic understanding of waterways
Addressing lack of captioning datasets for waterway environments
Balancing computational efficiency with visual feature modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces WaterCaption dataset for waterway captioning
Proposes Nano Transformer Adaptor for vision-language projection
Develops edge-deployable multi-modal model Da Yu
Runwei Guan
Runwei Guan
Hong Kong University of Science and Technology (Guangzhou) / Founder of FertiTech AI
Multi-Modal LearningUnmanned Surface VesselRadar PerceptionAI Medicine
N
Ningwei Ouyang
Xi’an Jiaotong-Liverpool University
T
Tianhao Xu
Hong Kong University of Science and Technology (Guangzhou)
S
Shaofeng Liang
China University of Petroleum (East China)
W
Wei Dai
Xi’an Jiaotong-Liverpool University
Y
Yafeng Sun
University of Science and Technology of China
S
Shang Gao
Hong Kong University of Science and Technology (Guangzhou)
Songning Lai
Songning Lai
HKUST(GZ)
Machine LearningDeep LearningMultimodalXAI
Shanliang Yao
Shanliang Yao
Yancheng Institute of Technology
Autonomous DrivingIntelligent VehiclesRadar-Camera FusionMaritime Perception
Xuming Hu
Xuming Hu
Assistant Professor, HKUST(GZ) / HKUST
Natural Language ProcessingLarge Language Model
R
Ryan Wen Liu
Wuhan University of Technology
Y
Yutao Yue
Hong Kong University of Science and Technology (Guangzhou)
Hui Xiong
Hui Xiong
Senior Scientist, Candela Corporation
Ultrafast dynamicsatomic molecular physicsfree electron laser