MarineEval: Assessing the Marine Intelligence of Vision-Language Models

📅 2025-12-24

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This study investigates whether vision-language models (VLMs) can effectively perform domain-specific question answering in marine science. Method: We introduce MarineEval—the first large-scale, expert-validated marine-domain benchmark comprising 2,000 image-question-answer triplets, covering seven task categories and twenty fine-grained capability dimensions. We propose a marine-science-oriented VLM evaluation framework featuring a domain-informed, multi-dimensional capability annotation schema, expert-driven data curation, and systematic error attribution analysis. Contribution/Results: Comprehensive evaluation across 17 state-of-the-art VLMs reveals that their average accuracy on marine QA tasks falls below 40% of human expert performance, exposing critical deficiencies in domain-specific semantic understanding, cross-modal reasoning, and integration of scientific commonsense knowledge. This work establishes a new paradigm for domain-specialized VLM evaluation and identifies key bottlenecks and actionable directions for advancing domain intelligence.

Technology Category

Application Category

📝 Abstract

We have witnessed promising progress led by large language models (LLMs) and further vision language models (VLMs) in handling various queries as a general-purpose assistant. VLMs, as a bridge to connect the visual world and language corpus, receive both visual content and various text-only user instructions to generate corresponding responses. Though great success has been achieved by VLMs in various fields, in this work, we ask whether the existing VLMs can act as domain experts, accurately answering marine questions, which require significant domain expertise and address special domain challenges/requirements. To comprehensively evaluate the effectiveness and explore the boundary of existing VLMs, we construct the first large-scale marine VLM dataset and benchmark called MarineEval, with 2,000 image-based question-answering pairs. During our dataset construction, we ensure the diversity and coverage of the constructed data: 7 task dimensions and 20 capacity dimensions. The domain requirements are specially integrated into the data construction and further verified by the corresponding marine domain experts. We comprehensively benchmark 17 existing VLMs on our MarineEval and also investigate the limitations of existing models in answering marine research questions. The experimental results reveal that existing VLMs cannot effectively answer the domain-specific questions, and there is still a large room for further performance improvements. We hope our new benchmark and observations will facilitate future research. Project Page: http://marineeval.hkustvgd.com/

Problem

Research questions and friction points this paper is trying to address.

Evaluates VLMs' marine domain expertise via image-based QA

Assesses VLMs' ability to handle specialized marine challenges

Benchmarks VLMs on diverse marine tasks to identify limitations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructs large-scale marine image-question dataset

Benchmarks 17 vision-language models on marine expertise

Integrates domain-specific requirements with expert verification

🔎 Similar Papers

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions