🤖 AI Summary
This study addresses the longstanding challenge in online mass spectrometry big data clustering—namely, the difficulty of simultaneously achieving scalability, metric flexibility, and algorithmic stability—by introducing a dynamical systems–based clustering framework. The proposed method decouples the similarity kernel from the optimization logic and integrates a novel density-enhanced similarity selection rule with geometric constraints, thereby ensuring deterministic, order-invariant convergence and eliminating the stochastic drift inherent in conventional heuristic approaches. With linear time complexity, the algorithm attains over 99.5% clustering purity and an adjusted Rand index of 0.99 on benchmark datasets, and successfully identifies rare industrial tracers present at abundances below 0.2% within a dataset of 25 million atmospheric aerosol mass spectra.
📝 Abstract
Modern online mass spectrometry generates multi-terabyte data streams critical for understanding Earth's environmental systems. However, extracting actionable chemical insights from these repositories is impeded by a computational bottleneck: existing clustering methods force a compromise among scalability, metric flexibility, and algorithmic stability. Here, we introduce Flexible Adaptive Stable Clustering (FASC), a dynamical systems framework that resolves these constraints by architecturally decoupling the similarity kernel from rigorous optimization logic. Unlike legacy heuristics that suffer from stochastic drift and algorithmic blending, FASC employs a Density-Augmented Similarity Selection rule and geometric constraints to guarantee deterministic, order-independent convergence. After validating FASC on canonical machine-learning ground truths (achieving >99.5% cluster purity and 0.99 Adjusted Rand Index), we deployed the framework on 25 million mass spectra of atmospheric aerosols. Demonstrating strictly linear empirical runtime scaling (O(N)), FASC autonomously mapped atmospheric aging pathways of secondary inorganic aerosols while isolating ultra-rare industrial tracers (<0.2% abundance), providing a scalable infrastructure for mining environmental big data.