UWBench: A Comprehensive Vision-Language Benchmark for Underwater Understanding

📅 2025-10-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Underwater images suffer from severe light attenuation, chromatic distortion, and scattering artifacts, while accurate interpretation demands domain-specific ecological expertise—challenges unaddressed by existing vision-language models (VLMs), which lack systematic evaluation in this setting. To bridge this gap, we introduce UWBench, the first large-scale benchmark for underwater vision-language understanding. It comprises 15,003 high-resolution underwater images and over 120,000 human-verified question-answer pairs, supporting three fine-grained tasks: descriptive captioning, visual referring expression grounding, and multi-hop ecological reasoning. UWBench features a hierarchical evaluation framework integrating object-level referential comprehension and domain-knowledge-based question answering. We conduct zero-shot and fine-tuned evaluations across state-of-the-art VLMs, revealing substantial performance deficits in underwater understanding. This benchmark establishes a standardized evaluation platform for marine AI, enabling rigorous assessment and advancement of models for ecological monitoring and autonomous underwater exploration.

Technology Category

Application Category

📝 Abstract
Large vision-language models (VLMs) have achieved remarkable success in natural scene understanding, yet their application to underwater environments remains largely unexplored. Underwater imagery presents unique challenges including severe light attenuation, color distortion, and suspended particle scattering, while requiring specialized knowledge of marine ecosystems and organism taxonomy. To bridge this gap, we introduce UWBench, a comprehensive benchmark specifically designed for underwater vision-language understanding. UWBench comprises 15,003 high-resolution underwater images captured across diverse aquatic environments, encompassing oceans, coral reefs, and deep-sea habitats. Each image is enriched with human-verified annotations including 15,281 object referring expressions that precisely describe marine organisms and underwater structures, and 124,983 question-answer pairs covering diverse reasoning capabilities from object recognition to ecological relationship understanding. The dataset captures rich variations in visibility, lighting conditions, and water turbidity, providing a realistic testbed for model evaluation. Based on UWBench, we establish three comprehensive benchmarks: detailed image captioning for generating ecologically informed scene descriptions, visual grounding for precise localization of marine organisms, and visual question answering for multimodal reasoning about underwater environments. Extensive experiments on state-of-the-art VLMs demonstrate that underwater understanding remains challenging, with substantial room for improvement. Our benchmark provides essential resources for advancing vision-language research in underwater contexts and supporting applications in marine science, ecological monitoring, and autonomous underwater exploration. Our code and benchmark will be available.
Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMs' capability in underwater visual-language understanding
Addressing unique challenges like light attenuation and color distortion
Establishing benchmarks for marine ecological reasoning and localization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces UWBench benchmark for underwater vision-language tasks
Contains 15003 annotated images with verified ecological descriptions
Establishes three benchmarks for captioning grounding and reasoning
🔎 Similar Papers
No similar papers found.
D
Da Zhang
Institute of Artificial Intelligence (TeleAI), China Telecom, China and also with the School of Artificial Intelligence, OPtics and ElectroNics (iOPEN), Northwestern Polytechnical University, Xi’an 710072, China
C
Chenggang Rong
Institute of Artificial Intelligence (TeleAI), China Telecom, China and also with the School of Artificial Intelligence, OPtics and ElectroNics (iOPEN), Northwestern Polytechnical University, Xi’an 710072, China
B
Bingyu Li
Institute of Artificial Intelligence (TeleAI), China Telecom, China
Feiyu Wang
Feiyu Wang
Fudan University
computer vision
Z
Zhiyuan Zhao
Institute of Artificial Intelligence (TeleAI), China Telecom, China
J
Junyu Gao
Institute of Artificial Intelligence (TeleAI), China Telecom, China and also with the School of Artificial Intelligence, OPtics and ElectroNics (iOPEN), Northwestern Polytechnical University, Xi’an 710072, China
X
Xuelong Li
Institute of Artificial Intelligence (TeleAI), China Telecom, China