🤖 AI Summary
This study addresses the subjectivity and low reproducibility inherent in conventional three-dimensional biogeochemical regionalization of the North Atlantic. We propose the first integrated UMAP-DBSCAN-NEMI machine learning framework: leveraging multi-source in situ measurements of temperature, salinity, dissolved oxygen, and nutrients, it employs UMAP for high-dimensional dimensionality reduction, DBSCAN for robust clustering, and NEMI for ensemble optimization—validated via external, internal, and relative metrics and repeated 100 times to quantify uncertainty. The method yields 321 physically interpretable 3D biogeochemical provinces, exhibiting strong agreement with classical water mass definitions and higher spatial resolution than Longhurst’s biogeographic provinces. Ensemble overlap reaches 88.81 ± 1.8%, while grid-level uncertainty is 15.49 ± 20%. This work establishes an objective, reproducible regionalization benchmark for mechanistic ocean process analysis and downstream applications such as marine protected area design.
📝 Abstract
Defining ocean regions and water masses helps to understand marine processes and can serve downstream-tasks such as defining marine protected areas. However, such definitions are often a result of subjective decisions potentially producing misleading, unreproducible results. Here, the aim was to objectively define regions of the North Atlantic. For this, a data-driven, systematic machine learning approach was applied to generate and validate ocean clusters employing external, internal and relative validation techniques. About 300 million measured salinity, temperature, and oxygen, nitrate, phosphate and silicate concentration values served as input for various clustering methods (KMeans, agglomerative Ward, and Density-Based Spatial Clustering of Applications with Noise (DBSCAN)). Uniform Manifold Approximation and Projection (UMAP) emphasised (dis-)similarities in the data while reducing dimensionality. Based on a systematic validation of the considered clustering methods and their hyperparameters, the results showed that UMAP-DBSCAN best represented the data. To address stochastic variability, 100 UMAP-DBSCAN clustering runs were conducted and aggregated using Native Emergent Manifold Interrogation (NEMI), producing a final set of 321 clusters. Reproducibility was evaluated by calculating the ensemble overlap (88.81 +- 1.8%) and the mean grid cell-wise uncertainty estimated by NEMI (15.49 +- 20%). The presented clustering results agreed very well with common water mass definitions. This study revealed a more detailed regionalization compared to previous concepts such as the Longhurst provinces. The applied method is objective, efficient and reproducible and will support future research focusing on biogeochemical differences and changes in oceanic regions.