OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models

📅 2026-04-25
📈 Citations: 0
Influential: 0
📄 PDF

career value

199K/year
🤖 AI Summary
Marine data are often fragmented, multimodal, highly noisy, and lack semantic alignment, significantly hindering the application of artificial intelligence in ocean science. To address these challenges, this work presents the first unified, semantically aligned, and scientifically grounded marine multimodal corpus, integrating sonar data, underwater imagery, charts, and textual descriptions. The authors propose a novel instruction data synthesis method guided by a domain-specific ocean concept knowledge graph. Through multi-source heterogeneous data fusion, hierarchical knowledge graph guidance, multi-stage quality control, and multimodal alignment with instruction fine-tuning, the approach substantially enhances model performance on marine-related tasks. The project also releases a high-quality corpus and a human-annotated evaluation benchmark to advance marine artificial intelligence research.
📝 Abstract
The vast and underexplored ocean plays a critical role in regulating global climate and supporting marine biodiversity, yet artificial intelligence has so far delivered limited impact in this domain due to a fundamental data bottleneck. Specifically, ocean data are highly fragmented across disparate sources and inherently exhibit multi-modal, high-noise, and weakly labeled characteristics, lacking unified schemas and semantic alignment. Although Multimodal Large Language Models (MLLMs) have achieved remarkable success in general domains, their application to ocean science remains severely constrained by the absence of large-scale, well-aligned multimodal datasets tailored to marine environments. To bridge this gap, we introduce OceanPile, a large-scale multimodal corpus designed for ocean foundation models. It comprises three key components: OceanCorpus, a unified collection integrating sonar data, underwater imagery, marine science visuals, and scientific text from diverse authoritative sources; OceanInstruction, a high-quality instruction dataset synthesized via a novel pipeline guided by a hierarchical Ocean Concept Knowledge Graph; and OceanBenchmark, a manually curated evaluation benchmark for rigorous assessment. We establish a multi-stage quality control process to ensure scientific validity and alignment across modalities. Experimental validation demonstrates significant performance improvements for models trained on our data. All datasets are publicly released to advance the field of marine artificial intelligence and empower domain-specific MLLMs.
Problem

Research questions and friction points this paper is trying to address.

ocean data
multimodal
data bottleneck
semantic alignment
foundation models
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal foundation models
ocean AI
knowledge graph-guided synthesis
cross-modal alignment
marine data integration
🔎 Similar Papers
No similar papers found.
Y
Yida Xue
College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China.
Ningyu Zhang
Ningyu Zhang
Ph.D. Student, Vanderbilt University
artificial intelligencelearning analyticslearning environments
T
Tingwei Wu
College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China.
Zhe Ma
Zhe Ma
imec
task schedulingcomputer architecturesoft error
D
Daxiong Ji
School of Software Technology, Zhejiang University, Ningbo 315048, China.
Z
Zhao Wang
School of Software Technology, Zhejiang University, Ningbo 315048, China.
G
Guozhou Zheng
State Key Laboratory of Ocean Sensing, Hangzhou 311200, China.
H
Huajun Chen
College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China.