Automated data curation for self-supervised learning in underwater acoustic analysis

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of inefficient utilization of massive unlabeled passive acoustic monitoring (PAM) data in marine acoustic surveillance, this paper proposes the first fully automated self-supervised data curation pipeline integrating Automatic Identification System (AIS) vessel trajectories with multi-source hydrophone recordings. Innovatively, it jointly models AIS spatiotemporal metadata and underwater acoustic signals, and introduces a hierarchical k-means–driven unsupervised audio sampling strategy to enable automatic PAM data sampling, class-balanced selection, and semantic diversity enhancement. The resulting dataset requires no manual annotation and natively supports end-to-end self-supervised pretraining. Experiments demonstrate substantial improvements: a 23% increase in whale detection F1-score and 91.4% accuracy in vessel noise classification—significantly enhancing robustness and scalability for marine mammal monitoring and anthropogenic noise assessment.

Technology Category

Application Category

📝 Abstract
The sustainability of the ocean ecosystem is threatened by increased levels of sound pollution, making monitoring crucial to understand its variability and impact. Passive acoustic monitoring (PAM) systems collect a large amount of underwater sound recordings, but the large volume of data makes manual analysis impossible, creating the need for automation. Although machine learning offers a potential solution, most underwater acoustic recordings are unlabeled. Self-supervised learning models have demonstrated success in learning from large-scale unlabeled data in various domains like computer vision, Natural Language Processing, and audio. However, these models require large, diverse, and balanced datasets for training in order to generalize well. To address this, a fully automated self-supervised data curation pipeline is proposed to create a diverse and balanced dataset from raw PAM data. It integrates Automatic Identification System (AIS) data with recordings from various hydrophones in the U.S. waters. Using hierarchical k-means clustering, the raw audio data is sampled and then combined with AIS samples to create a balanced and diverse dataset. The resulting curated dataset enables the development of self-supervised learning models, facilitating various tasks such as monitoring marine mammals and assessing sound pollution.
Problem

Research questions and friction points this paper is trying to address.

Automating analysis of large underwater sound datasets
Addressing lack of labeled underwater acoustic recordings
Creating balanced datasets for self-supervised learning models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated self-supervised data curation pipeline
Integrates AIS data with hydrophone recordings
Uses hierarchical k-means clustering for sampling
🔎 Similar Papers
No similar papers found.